Index

Recurrent Neural Network (RNN)

What is RNN
Why RNN

Problems of RNN

Vanishing-Exploding Gradients and TBPTT
- Why the big number is a problem since ∞ means a big number? $\large{\color{Purple}( \infty )}$
- Why the small number is a problem?
- Gradient clipping for exploding gradients
Solution for Vanishing gradients
Solution for expensive gradient Computation Truncated Back Propagation Through Time(TBPTT)

Deep Recurrent Neural Network (RNN)

Varients of Recurrent Neural Network

Gated Recurrent Unit (GRU)
LSTM

Recurrent Neural Network (RNN)

A Recurrent Neural Network is a type of ANN that contains loops, allowing information to be stored within the network. It is absolutely essential for sequential like information (variable input size)

Why Recurrent Neural Networks(RNN)?

$$ {\color{Purple} \boxed{ \large\begin{align*} & \textbf{RNN} && \textbf{ANN and CNN}\\ & \textrm{Variable length Input.}&& \textrm{Fixed sized inputs.}\\ & \textrm{Sequential Data or time series data.}&& \textrm{The whole input available simultaneously.}\\ & \textrm{E.g. Text data, where context matters.}&& \\ & \textrm{Used in- Speech processing,}&& \textrm{Used in- other deep learning work.}\\ & \textrm{Language Translation, video analysis.}&& \\ \end{align*} } } $$

Drawbacks of CNN/ANN

No memory element.
The present data doesn't dependent on the pevious data.

RNN Layers

In general- variably sized, sequential data combine an input vector with a state vector via a fixed function to produce a new state.

variably sized: Number of features are fixed the size of the data is not.
It looks very much like a feedforward neural network, except it also has connections pointing backward.

Incorporate with RNN the idea of equally spread, repetative, temporal relationship

RNN unrolling over time

Backpropagation Through Time (BPTT)

Weight matrix for a single hidden layer RNN | RNN- Many-To-Many | Total number of Layers = 'T'

For Hiddden layers for any $\large{\color{Purple} h_t}$ -

$$ \Huge{\color{Purple} \begin{align*} h_t = g (W_{hh}h_{t-1}+W_{xh} x_t) + \textrm{b} & & \normalsize \begin{cases} g = \textit{non-linear function} \\ W_{{\color{Cyan}hh}} = \textit{ takes an \textbf{h} and gives out an \textbf{h}} \\ W_{{\color{Cyan}xh}} = \textit{ takes an \textbf{x} and gives out an \textbf{h}}\\ W_{{\color{Cyan}xh}}, W_{{\color{Cyan}xh}} , \textbf{b}_n = \textit{ are constant with time} \end{cases} \end{align*} } $$

Description

The linear combination of $\Huge{\color{Purple} h}$ and $\Huge{\color{Purple} x}$ and weight matrix $\Huge{\color{Purple} W}$ which we will multiply $\Huge{\color{Purple} h_{t-1}}$ of previous layer and some other weight matrix $\Huge{\color{Purple} W}$ which we will multiply $\Huge{\color{Purple} x_{t}}$ of same layer.

For Output layer for any instance $\large{\color{Purple} \hat{y}_t}$ -

$$ \Huge{\color{Purple} \hat{y_t} = g^* ( W_{yh} h_t + b)} $$

Description

$\large{\color{Purple} g }$ needs not to be same as $\large{\color{Purple} g^* }$, even $\large{\color{Purple} g^* }$ not always be a Non-linear function.

Step #1: Weight calculation for hidden layers and output layers

To train an RNN, the trick is to unroll it through time and then simply use regular backpropagation. This strategy is called backpropagation through time (BPTT).

Like regular backpropagation, there is a first forward pass through the unrolled network (represented by the dashed arrows).

Note: Backpropagation is common in ANN or in Multi-Layer Perceptron.

Inorder to make the expressions simple we put allias in the above two expressionas like

Now $\large{\color{Purple} h_t}$ and $\large{\color{Purple} \hat{y}_t}$ looks like-

$$ {\color{Red} \boxed{\Huge{\color{Purple} \begin{align*} & h_t = g (W h_{t-1}+ U x_t) + \textrm{b} \\ & \hat{y_t} = g^* ( V h_t) \\ \end{align*}} }} $$

Description

The Matrixes $\large{\color{Purple} \textbf{W} }$, $\large{\color{Purple} \textbf{U} }$, $\large{\color{Purple} \textbf{V} }$ do not change with time (or across the layers); Meaning same values in each epoch. Where as in ANN those matrix changes its values. We need to update them. See for Weight update in Backpropagation.

⚛️ How does RNN keep the context?

Answer: The following vectors $https://latex.codecogs.com/svg.image?{\color{Purple}\textrm{W, U, V } }$ do not change with time (or across the layers).

Step #2: Loss calculation

For the Backpropagation we need to findout the derivative of the loss function let say $\large{\color{Purple} L }$ with each of the matrices $\large{\color{Purple} W }$, $\large{\color{Purple} U }$, $\large{\color{Purple} V}$ -

$$ \Huge {\color{Purple} \frac{\partial \textrm{L}}{\partial \textrm{W}},\ \frac{\partial \textrm{L}}{\partial \textrm{U}},\ \frac{\partial \textrm{L}}{\partial \textrm{V}}} {\color{Purple} \Big \{ \normalsize \textrm{For the Backprop we need to findout the gradient of 'L' with respect to each of the matrices} } $$

Recap of Loss function and $\large{\color{Purple}\partial L }$

The output sequence is evaluated using a cost function

$$ \Huge{\color{Purple} \begin{align*} \textbf{L} = \sum_{t=1}^{\textrm{T}} \textbf{L}_{t} & & \normalsize \begin{cases} \textrm{where } T \textrm{ is the max time step} \\ \textrm{Summation of all the intermediate losses through the layers}\\ \end{cases} \end{align*}} $$

Description

Summation of all the intermediate losses through the layers.

Example #1: Lets consider local loss L3 and see how backprop works

$$ \Huge {\color{Purple} \frac{\partial \textrm{L}_3}{\partial \textrm{W}},\ \frac{\partial \textrm{L}_3}{\partial \textrm{U}},\ \frac{\partial \textrm{L}_3}{\partial \textrm{V}} } $$

We assume that g is a non-linear function and-

$$ \Huge {\color{Purple} \hat{y_3} = g(V h_3) } $$

Loss functions used least-squares error

$$ \Huge {\color{Purple} \mathrm{L_3} = \frac{1}{2}(y_3 - \hat{y_3})^2 } $$

We need to findout

$$ \Huge {\color{Purple} \begin{align*} & \frac{\partial \textrm{L}_3}{\partial \textrm{V}} & {\color{Black} \large \textrm{which can be expressed by- }} \\ & \frac{\partial \textrm{L}_3} {\partial \textrm{V}} = \frac{\partial \textrm{L}_3}{\partial \mathrm{\hat{y_3}}} \frac{\partial \mathrm{\hat{y_3}}}{\partial \textrm{V}} &\\ & \frac{\partial \textrm{L}_3} {\partial \textrm{V}} = - (\mathrm{y_3 - \hat{y_3}}) \mathrm{h_3} &\\ \end{align*} } $$

[To be continued]

⚛️ What is $\large \frac{\partial \textrm{L}_3}{\partial \textrm{W}}$ ?

$$ \Huge {\color{Purple} \begin{align*} & \frac{\partial \textrm{L}_3} {\partial \textrm{W}} = \frac{\partial \textrm{L}_3}{\partial \mathrm{\hat{y_3}}} \frac{\partial \mathrm{\hat{y_3}}}{\partial \textrm{h}_3} \frac{\partial \textrm{h}_3}{\partial \textrm{W}}&\\ & \frac{\partial \textrm{L}_3} {\partial \textrm{V}} = - (\mathrm{y_3 - \hat{y_3}}) \mathrm{h_3} \mathrm{V}&\\ \end{align*} } $$

[To be continued]

Step #3:

The gradients of that cost function are then propagated backward through the unrolled network (represented by the solid arrows).

Finally

The model parameters are updated using the gradients computed during BPTT.

Note that the gradients flow backward through all the outputs used by the cost function, not just through the final output (for example, in Figure the cost function is computed using the last three outputs of the network, $\textbf{Y}_{(2)}, \textbf{Y}_{(3)}\ and \ \textbf{Y}_{(4)}$ , so gradients flow through these three outputs, but not through Y₍₀₎ and Y₍₁₎ ).

Moreover, since the same parameters W and b are used at each time step, backpropagation will do the right thing and sum over all time steps.
Fortunately, tf.keras takes care of all of this complexity for you

Hidden Layers:

⚛️ What is the most general function we typically use within neural network?

Answer: We take linear combination followed by non-linearity always. Typically in RNNs we usually use tanh for the nonlinearity in the hidden layers.

$$\Huge{\color{Purple} \begin{align*} & \ h_t = f_w (h_{t-1}, x_t) &\\ & \Huge \boxed{h_t = tanh (W_{{\color{Cyan}hh}}h_{t-1},W_{{\color{Cyan}xh}} x_t) + \textrm{b}} & \normalsize \begin{cases} W_{{\color{Cyan}hh}} = \textit{ takes an \textbf{h} and gives out an \textbf{h}} \\ W_{{\color{Cyan}xh}} = \textit{ takes an \textbf{x} and gives out an \textbf{h}}\\ W_{{\color{Cyan}xh}}, W_{{\color{Cyan}xh}} , \textbf{b}_n = \textit{ are constant with time} \end{cases} \end{align*} } $$

So in this case, the non-linear function is tanh and we need a linear combination of h and x. So there will be some weight matrix W which we will multiply h and some other weight matrix W which we will multiply x.
Those two weight matrices are different in general. Not only that, they also have different sizes.
So this is the general formula for the hidden layer of an RNN, some people will replace this tanh by f_w or by g.

The calculation for hiddenlayer h₂ would be-

$$ \large{\color{Purple} h_2 = tanh (W_{hh}h_{1},W_{xh} x_2) + \textrm{b}} $$

Output Layer:

The Output size is variable.

⚛️ What about this $\large{\color{Purple} \hat{y}_t}$ ?

Answer: $\large{\color{Purple} \hat{y}_t}$ is equal to some function of $\large{\color{Purple} h_t}$.

Now in some cases, it simply makes sense for this function to be a linear function( for regression output). In some cases, it makes sense for the function to be a non-linear function ( for Classification output).

If it is a classifcation task and let us say it is a binary classification task, then g will become a Sigmoid σ .
If it is a multiclass classification task, you will use a Softmax.

⚛️ Constant with time meaning

Answer: The weights and bias in h₃ are the same W_hh , W_xh and b_n for h₂

⚛️ Calculating Loss in RNN

RNN- Many-To-Many | The number of Layers = 'T'

Example:

Now when you have multiple predicted values, let us say having 10 days before is the weather of $\large{\color{Purple} h_0}$ or temperature of $\large{\color{Purple} {x}_0}$ in some city, let us say Chennai.
Suppose you have that input, you would have the next day's temperature, let us say that is $\large{\color{Purple} \hat{y_1}}$ , the next day's temperature $\large{\color{Purple} \hat{y_2}}$ and next day's temperature $\large{\color{Purple} \hat{y_3}}$ , till let us say today's temperature which is $\large{\color{Purple} \hat{y_T}}$ .
Now for each one of them, you also have a corresponding ground truth, which should be $\large{\color{Purple} y_1,\ y_2,\ y_3, \ y_T }$ . And whenever you have a ground truth and a prediction and these two differs, you will have a loss function.
So the total loss is -

$$ \Huge{\color{Purple} \begin{align*} & \boxed{ \textbf{L} = \sum_{t=1}^{\textrm{T}} \textbf{L}_{t} } & \Big \{ \normalsize \textit{ Summation of all the intermediate losses through the layers} \\ \end{align*} } $$

Now in terms of L_t itself, or the local loss function, you again have many choices but we having seen only 2 so far,
1. cross entropy -classifcation
2. least-squares error - regression or a numerical output

Problems in Training Simple RNNs

The main problem with RNN is Unstable Gradient

The basic issue for which we had to do BPTT was because W, U, V matrices were constants across time. Because of which you had sort of recursive expressions for the loss with respect to W and the loss with respect to U.

The main issues that come up are

gradient calculations either explode or vanish, both of these are not ideal.
The gradient calculations are expensive.

The Solution

The solution for exploding gradients is gradient clipping.
The solution for vanishing gradients is alternate architectures LSTM, GRU.
The solution for expensive gradient calculations is Truncated Back Propagation Through Time(TBPTT).

Vanishing-Exploding Gradients and TBPTT

⚫ How to identify exploding gradients?

There are a few ways by which you can get an idea of whether your model is suffering from exploding gradients or not. They are:

If the model weights become unexpectedly large in the end.
Your model has a poor loss.
Or the model displays NaN loss whilst training.
The gradient value for error persists over 1.0 for every subsequent iteration during training.

⚫ How to identify vanishing gradients?

Link Detect Vanishing Gradients ↗️

Vanishing Gradient and Exploding Gradient in Sigmoid Function ↗️

Truncated Back Propagation Through Time(TBPTT)

TBPTT Works with both the** vanishing gradient** and the exploding gradient problems, so it is sort of a compound solution.

Now remember the example that we had before this, that example had 65,000 time sequences.

Now, would you go back for the full thing and come back through the full thing, by that time almost any correction you give will lead to vanishing or exploding gradient problems.
Plus it would become potentially very expensive just to do one gradient up-date.

TBPTT- forward propagate through 'k1' steps and back propagate through 'k2' steps

Lets consider the boxes here, each box represents an input, hidden layer and output with its corresponding loss. Now, Remember that we are assuming that the relationship is the same and in fact you can cut it anywhere in the middle and you are going to get exactly the same W.

The basic idea that is instead of training for the whole sequence , you split it up into many mini batches.

I will forward propagate through 2 steps, then back propagate through 2 steps. Then the W is remains the same everywhere. So I will get some new updated W.
Then I step forward little bit, forward propagates through another 2 steps, back propagates through 2 steps, my W is now updated.
Now when the W is updated, I will forward propagate through the whole thing and I keep on doing this.

All you are doing is forward propagating through one part of the data, back propagating through a different part of the data. How does this help? If you back propagates through a small amount of data, your gradient will neither blow up, nor will it vanish.

Recap of BPTT

Typical RNN

Total Loss-

$$ \Huge{\color{Purple} \begin{align*} \mathrm{L}=\sum_{t=1}^{\mathrm{T}} \mathrm{L}_t \end{align*}} $$

⚛️ When we are calculating w if we are doing simple gradient descent

$$ \Huge{\color{Purple} \begin{align*} & \mathrm{W}= \mathrm{W} - \alpha \frac{\partial\mathrm{L}}{\partial\mathrm{W}} \\ & \large \frac{\partial\mathrm{L}}{\partial\mathrm{W}} \Big \} \textrm{This term has to be calculated as: } \sum_{t=1}^{T} \frac{\partial \mathrm{L}_t}{\partial\mathrm{W}} \\ \end{align*}} $$

We saw that you cannot simply calculate, let us say if I have L₃, I cannot simply calculate $https://latex.codecogs.com/svg.image? \frac{\partial\mathrm{L}_3}{\partial\mathrm{W}}$ in the usual way besause

$$\Huge{\color{Purple} \begin{align*} &\frac{\partial\mathrm{L}_3}{\partial\mathrm{W}} \to \frac{\partial\mathrm{h}_3}{\partial\mathrm{W}}\to \frac{\partial\mathrm{h}_2}{\partial\mathrm{W}}\to \frac{\partial\mathrm{h}_1}{\partial\mathrm{W}}\\ & \large \textrm{They involve each other} \\ \end{align*}} $$

Above is applicable for U₃, as well

$$\Huge{\color{Purple} \begin{align*} &\frac{\partial\mathrm{L}_3}{\partial\mathrm{U}} \to \frac{\partial\mathrm{h}_3}{\partial\mathrm{U}}\to \frac{\partial\mathrm{h}_2}{\partial\mathrm{U}}\to \frac{\partial\mathrm{h}_1}{\partial\mathrm{U}}\\ & \large \textrm{They involve each other} \\ \end{align*}} $$

This is basically what we call back propagation through time, because none of these terms is independent. Now this kind of dependency creates several problems.

⚛️ Explanation Why is BPTT is a problem:

Heuristic Description.(rough)

$$\Huge{\color{Purple} \begin{align*} & h_t = tanh(W h_{t-1} + U x_t) \\ & \large \textrm{by cancelling } U x_t \textrm{ we get} \\ & h_t \sim tanh(W h_{t-1}) [\large \textrm{eigenvalue need to understand}]\\ & h_t \sim W h_{t-1} \large\textrm{ then}\\ & h_{t+1} \sim W^2 h_{t-1} \large [because \sim W^2 h_t \sim W^2 h_{t-1}] \\ & \large \textrm{In general- }\\ & \boxed{h_{t+1} \sim W^{n} h_{t} }\\ \end{align*}} $$

So, as you go through time, so the weight matrix keeps on constantly multiplying. So h₃ would be like W₂h₁ and if I have something like h₅, that would become W₄h₁ so on and so forth.

Now, all these are heuristic arguments but it turns out to be a remarkably good approximations, unfortunately I cannot go further.

But if I have norm(let us say 2 norm) of ∥h_t+n ∥ , notice h^⃗_t is a vector.
If I take its norm, it will be some factor times norm of h_t ( ∥h_t+n ∥ ∼ ∥h_t∥ )
Norm is a scaler, so this is a number, you are trying to find out the size of h_t+n, that will be some number times h_t
And it turns out that it scales approximately as the eigenvalues (λ) of Wⁿ. Like the following-

$$\Huge{\color{Purple} \begin{align*} & {\color{Cyan}\vec{{\color{Purple}h_{t+n}}}} \sim W^n {\color{Cyan}\vec{{\color{Purple}h_t}}} \\ & {\parallel \mathrm{h_{t+n}} \parallel }_2 \sim \lambda^n \parallel \mathrm{h_t}\parallel \\ \end{align*} } $$

Another way to see this is to assume that the W is diagonal, If W is diagonal, all it will have, Wⁿ will be, all its eigenvalues or all its diagonal terms to the power n.
Now which eigenvalue, we will see shortly.
The eigenvalue will either be the largest or the smallest.
- The worst-case scenario is if the eigenvalue will be the largest
- The best case scenario or the smallest case scenario is if the eigenvalue will be the smallest.
Beacause of scaling as long as I use the same W, which I do for RNN, throughout time, what happens is these vectors constantly get larger in magnitude or constantly get smaller in magnitude.
So, if you have a large number of time steps, this number, even if it is small, you know, for example even if it adds to 1.01, over time it is going to get to be a huge number.(this is the power of the exponential function or of the power function)

$$\large{\color{Purple} \begin{align*} \textrm{For every 'h' }& & & \\ & \textrm{If } \huge{\mathrm{\lvert \lambda \rvert > 1}} & \textrm{ As 'n' increases }& \mathrm{\parallel h_{t+n}\parallel \to \infty }& \textrm{ (Become very large)}\\ & \textrm{If } \huge{\mathrm{\lvert \lambda \rvert < 1}} & \textrm{ As 'n' increases }& \mathrm{\parallel h_{t+n}\parallel \to 0 }\\ \end{align*} } $$

Now this is simply for h, you can show that and I would request you to try this out by looking at the expressions in BPTT, the similar arguments hold true for $\frac{\partial \mathrm{L_3}}{\partial \mathrm{W}}$ also.

$$\large{\color{Purple} \begin{align*} \boxed{\frac{\partial \mathrm{L_3}}{\partial \mathrm{W}} \to \frac{\partial \mathrm{h_3}}{\partial \mathrm{W} } } \to W \frac{\partial \mathrm{L_2}}{\partial \mathrm{W}}\\ \end{align*} } $$

$$\Huge{\color{Purple} \begin{align*} & \parallel \frac{\partial \textrm{L}}{\partial \textrm{W}} \parallel \to \infty & \large \textbf{Exploding Gradient} \\ & \parallel \frac{\partial \textrm{L}}{\partial \textrm{W}} \parallel \to 0 & \large \textbf{Vanishing Gradient} \\ \end{align*} \left \} \begin{matrix} \\ \large \textrm{Very Difficult to train}\\ \\ \end{matrix}\right. } $$

⚛️ Why the big number is a problem since ∞ means a big number?

Answer: These is a problems because obviously you are never going to get exactly ∞ because you are still dealing with finite number. But the problem is the moment it goes about the largest number that your machine can calculate, it will actually show you NAN, not a number or it will show you ∞, so on and so forth. So really speaking, finite preciation machines cannot handle exploding gradient.

⚛️ Why the small number is a problem?

Answer: Similarly you will never actually go to 0. If you do like 0:99¹⁰⁰⁰ (because of finite preciation machines), it will be very very very small number. But the problem is it might actually becomes smaller than 10^-16, which is the smallest number that you can represent accurately. So at that point you will no longer train, so that will be called saturation. So you will get a very small gradient and that is practically gone.

There is another problem, notice this tanh, even the tanh is being repeated multiple times. So you have h_t = tanh(W h_t-1), h_t+1 will be tanh(h_t), so you have tanh², similarly you will have tanh³.

tanh² is flat, tanh³ will look even flatter

Look at the tanh, tanh², tanh² will look even flatter. And if you take tanh¹⁰⁰, it will look even smaller and notice in all these cases, gradients become flatter and flatter and flatter and they become very small.

Now, all these problems put together lead to these 2 issues. The tanh, repeated tanh problem will lead only to the vanishing gradient issue but large number of players can either lead to exploding gradient or it can lead to vanishing gradient, both of these make training very dificult.

⚛️ Gradient clipping for exploding gradients

Answer: It is very simple, we decide on a maximum allowable gradient size. What do I mean by value of gradient? Gradient is a vector, so you cannot give it a value, you can however give a value to norm of gradient.

Say-

$$\Huge{\color{Purple} \begin{align*} & \frac{\partial \textrm{L}}{\partial \textrm{W}} = \vec{g} & max \parallel \vec{g} \parallel = G_{max} \\ \end{align*} } $$

Gradient Descent Calculate for $\large{\color{Purple}\vec{g}}$

$$\Huge{\color{Purple} \begin{align*} & \textit{If } \parallel \vec{g} \parallel < G_{max} \large \textrm{[ Proceed as usual]} \\ & \textit{If not, } \parallel \vec{g^*} \parallel = \frac{\vec{g}}{\parallel \vec{g} \parallel} G_{max} \\ \end{align*} } $$

My new gradient $\vec{g^*}$ is in the same direction as the gradient you calculated but I am cutting down its size

⚛️ Solution for Vanishing gradients

Answer: Unfortunately no such simple solution exists. LSTM, GRU is the result

⚛️ Solution for expensive gradient Computation Truncated Back Propagation Through Time(TBPTT)

Answer: This solution kind of handles to a certain extent, both the vanishing gradient and the exploding gradient problems.

So, if we have data with thousands and thousands of time steps. And you want to calculate back propagation through time. Now how would you do it?

Step #1

Forward propagate through the whole thing, calculate the whole of $https://latex.codecogs.com/svg.image?{\color{Purple} \mathbf{L_T}}$

Step #2

Then you will back propagate through the whole thing.

Now we have 65,000 time sequences. If you go back for the full thing and come back through the full thing, by that time almost any correction you give will lead to vanishing or exploding gradient problems, plus it would become potentially very expensive just to do one gradient update.

So the solution to that is truncated back propagation through time.

Since throughout the RNN network you are going to get exactly the same W. So, instead of training for the whole sequence, you split it up into many mini batches( similar like mini batch gradient descent).

I will forward propagate through first 2 steps and back propagate through 2 steps, this is one possibility. Since W is the same everywhere. So I will get some new updated W.

Next I forward propagates through another 2 steps, back propagates through 2 steps, my W is now updated , okay. Now when the W is updated, I will forward propagate through the whole thing, okay. So I keep on doing this.

If you back propagates through a small amount of data, your gradient will neither blow up, nor will it vanish. Now what is a good rule of thumb? It is actually hard to say for some problems, hundred steps are good for some problems, 10, 20 steps are good, etc.

Deep RNNs

Deep RNN

The deep RNNs are particularly important in language processing especially in language translation.

$${\color{Purple} \begin{align*} & \huge h_t^{(l)} = tanh \Big( W^{(l)}h_{t-1}^{(l)}+U^{(l)}h_{t}^{(l-1)} \Big )\\ \end{align*} } $$

Now, what are deep RNNs let us look at just one of these if I look at one of these within the RNN it is just an ANN as we saw with normal RNNs. In a normal RNN all you had was one input layer, one hidden layer and one output layer. In a deep RNN all you do is that one single layer of the RNN actually become a deep neural network that is the only difference between a deep RNN and a normal RNN.

Different types of RNN based on Input and Output Sequences

Many-to-Many (top left), Many-to-One (top right), One-to-Many (bottom left), and Encoder–Decoder (bottom right) networks

Many-to-Many(Sequence-to-Sequence) Network

An RNN can simultaneously take a sequence of inputs and produce a sequence of outputs.
Example #1: Language Translation, Speech Recognition
- The output does not comes simultaniously with the input and the size of the output need not to be same as input
Example #2: Video frame by frame analysis
- The output size fixed by the input size

Many-to-One (Sequence-to-Vector) Network

You could feed the network a sequence of inputs and ignore all outputs except for the last one.
For example: Sentiment Analysis- you could feed the network a sequence of words corresponding to a movie review and the network would output a sentiment score.

One-to-Many(Vector-to-Sequence) Network

Conversely, you could feed the network the same input vector over and over again at each time step and let it output a sequence.
For example: Image Captioning- the input could be an image (or the output of a CNN), and the output could be a caption(text) for that image.

Encoder-Decoder Network

Lastly, you could have a sequence-to-vector network, called an encoder, followed by a vector-to-sequence network, called a decoder.
For example, this could be used for translating a sentence from one language to another.
- You would feed the network a sentence in one language, the encoder would convert this sentence into a single vector representation, and then the decoder would decode this vector into a sentence in another language.
- This two-step model, called an Encoder–Decoder, works much better than trying to translate on the fly with a single sequence-to-sequence RNN (like the one represented at the top left): the last words of a sentence can affect the first words of the translation, so you need to wait until you have seen the whole sentence before translating it.

Bibliography

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition by Aurélien Géron
NPTEL(Dr. Balaji Srinivasan)
YouTube 1

Files

README.md

Latest commit

History

README.md

File metadata and controls

Index

Recurrent Neural Network (RNN)

Problems of RNN

Deep Recurrent Neural Network (RNN)

Varients of Recurrent Neural Network

Recurrent Neural Network (RNN)

Why Recurrent Neural Networks(RNN)?

Drawbacks of CNN/ANN

RNN Layers

Incorporate with RNN the idea of equally spread, repetative, temporal relationship

Backpropagation Through Time (BPTT)

For Hiddden layers for any $\large{\color{Purple} h_t}$ -

Description

For Output layer for any instance $\large{\color{Purple} \hat{y}_t}$ -

Description

Step #1: Weight calculation for hidden layers and output layers

Now $\large{\color{Purple} h_t}$ and $\large{\color{Purple} \hat{y}_t}$ looks like-

Description

⚛️ How does RNN keep the context?

Step #2: Loss calculation

Recap of Loss function and $\large{\color{Purple}\partial L }$

Description

Example #1: Lets consider local loss L3 and see how backprop works

⚛️ What is $\large \frac{\partial \textrm{L}_3}{\partial \textrm{W}}$ ?

Step #3:

Finally

Hidden Layers:

⚛️ What is the most general function we typically use within neural network?

Output Layer:

⚛️ What about this $\large{\color{Purple} \hat{y}_t}$ ?

⚛️ Constant with time meaning

⚛️ Calculating Loss in RNN

Example:

Problems in Training Simple RNNs

The Solution

Vanishing-Exploding Gradients and TBPTT

⚫ How to identify exploding gradients?

⚫ How to identify vanishing gradients?

Vanishing Gradient and Exploding Gradient in Sigmoid Function ↗️

Truncated Back Propagation Through Time(TBPTT)

All you are doing is forward propagating through one part of the data, back propagating through a different part of the data. How does this help? If you back propagates through a small amount of data, your gradient will neither blow up, nor will it vanish.

Recap of BPTT

Total Loss-

⚛️ When we are calculating w if we are doing simple gradient descent

⚛️ Explanation Why is BPTT is a problem:

So, as you go through time, so the weight matrix keeps on constantly multiplying. So h3 would be like W2h1 and if I have something like h5, that would become W4h1 so on and so forth.

⚛️ Why the big number is a problem since ∞ means a big number?

⚛️ Why the small number is a problem?

⚛️ Gradient clipping for exploding gradients

Gradient Descent Calculate for $\large{\color{Purple}\vec{g}}$

⚛️ Solution for Vanishing gradients

⚛️ Solution for expensive gradient Computation Truncated Back Propagation Through Time(TBPTT)

Step #1

Step #2

Deep RNNs

Different types of RNN based on Input and Output Sequences

Many-to-Many(Sequence-to-Sequence) Network

Many-to-One (Sequence-to-Vector) Network

One-to-Many(Vector-to-Sequence) Network

Encoder-Decoder Network

Bibliography

So, as you go through time, so the weight matrix keeps on constantly multiplying. So h₃ would be like W₂h₁ and if I have something like h₅, that would become W₄h₁ so on and so forth.