Skip to content

Latest commit

 

History

History
637 lines (457 loc) · 35.9 KB

README.md

File metadata and controls

637 lines (457 loc) · 35.9 KB

Index

dark

Recurrent Neural Network (RNN)

Varients of Recurrent Neural Network

Recurrent Neural Network (RNN)

dark A Recurrent Neural Network is a type of ANN that contains loops, allowing information to be stored within the network. It is absolutely essential for sequential like information (variable input size)

Why Recurrent Neural Networks(RNN)?

light

$$ {\color{Purple} \boxed{ \large\begin{align*} & \textbf{RNN} && \textbf{ANN and CNN}\\ & \textrm{Variable length Input.}&& \textrm{Fixed sized inputs.}\\ & \textrm{Sequential Data or time series data.}&& \textrm{The whole input available simultaneously.}\\ & \textrm{E.g. Text data, where context matters.}&& \\ & \textrm{Used in- Speech processing,}&& \textrm{Used in- other deep learning work.}\\ & \textrm{Language Translation, video analysis.}&& \\ \end{align*} } } $$

Drawbacks of CNN/ANN

  • No memory element.
  • The present data doesn't dependent on the pevious data.

RNN Layers

light In general- variably sized, sequential data combine an input vector with a state vector via a fixed function to produce a new state.

  • variably sized: Number of features are fixed the size of the data is not.
  • It looks very much like a feedforward neural network, except it also has connections pointing backward.

Incorporate with RNN the idea of equally spread, repetative, temporal relationship


RNN unrolling over time

Backpropagation Through Time (BPTT)

dark


Weight matrix for a single hidden layer RNN | RNN- Many-To-Many | Total number of Layers = 'T'

For Hiddden layers for any $\large{\color{Purple} h_t}$ -

$$ \Huge{\color{Purple} \begin{align*} h_t = g (W_{hh}h_{t-1}+W_{xh} x_t) + \textrm{b} & & \normalsize \begin{cases} g = \textit{non-linear function} \\ W_{{\color{Cyan}hh}} = \textit{ takes an \textbf{h} and gives out an \textbf{h}} \\ W_{{\color{Cyan}xh}} = \textit{ takes an \textbf{x} and gives out an \textbf{h}}\\ W_{{\color{Cyan}xh}}, W_{{\color{Cyan}xh}} , \textbf{b}_n = \textit{ are constant with time} \end{cases} \end{align*} } $$

Description

The linear combination of $\Huge{\color{Purple} h}$ and $\Huge{\color{Purple} x}$ and weight matrix $\Huge{\color{Purple} W}$ which we will multiply $\Huge{\color{Purple} h_{t-1}}$ of previous layer and some other weight matrix $\Huge{\color{Purple} W}$ which we will multiply $\Huge{\color{Purple} x_{t}}$ of same layer.

For Output layer for any instance $\large{\color{Purple} \hat{y}_t}$ -

$$ \Huge{\color{Purple} \hat{y_t} = g^* ( W_{yh} h_t + b)} $$

Description

$\large{\color{Purple} g }$ needs not to be same as $\large{\color{Purple} g^* }$, even $\large{\color{Purple} g^* }$ not always be a Non-linear function.

Step #1: Weight calculation for hidden layers and output layers

To train an RNN, the trick is to unroll it through time and then simply use regular backpropagation. This strategy is called backpropagation through time (BPTT).

Like regular backpropagation, there is a first forward pass through the unrolled network (represented by the dashed arrows).

Note: Backpropagation is common in ANN or in Multi-Layer Perceptron.

Inorder to make the expressions simple we put allias in the above two expressionas like

Now $\large{\color{Purple} h_t}$ and $\large{\color{Purple} \hat{y}_t}$ looks like-

$$ {\color{Red} \boxed{\Huge{\color{Purple} \begin{align*} & h_t = g (W h_{t-1}+ U x_t) + \textrm{b} \\ & \hat{y_t} = g^* ( V h_t) \\ \end{align*}} }} $$

Description

The Matrixes $\large{\color{Purple} \textbf{W} }$, $\large{\color{Purple} \textbf{U} }$, $\large{\color{Purple} \textbf{V} }$ do not change with time (or across the layers); Meaning same values in each epoch. Where as in ANN those matrix changes its values. We need to update them. See for Weight update in Backpropagation.


⚛️ How does RNN keep the context?

Answer: The following vectors do not change with time (or across the layers).


Step #2: Loss calculation

For the Backpropagation we need to findout the derivative of the loss function let say $\large{\color{Purple} L }$ with each of the matrices $\large{\color{Purple} W }$, $\large{\color{Purple} U }$, $\large{\color{Purple} V}$ -

$$ \Huge {\color{Purple} \frac{\partial \textrm{L}}{\partial \textrm{W}},\ \frac{\partial \textrm{L}}{\partial \textrm{U}},\ \frac{\partial \textrm{L}}{\partial \textrm{V}}} {\color{Purple} \Big \{ \normalsize \textrm{For the Backprop we need to findout the gradient of 'L' with respect to each of the matrices} } $$

Recap of Loss function and $\large{\color{Purple}\partial L }$

$$ \Huge{\color{Purple} \begin{align*} \textbf{L} = \sum_{t=1}^{\textrm{T}} \textbf{L}_{t} & & \normalsize \begin{cases} \textrm{where } T \textrm{ is the max time step} \\ \textrm{Summation of all the intermediate losses through the layers}\\ \end{cases} \end{align*}} $$

Description

Summation of all the intermediate losses through the layers.


Example #1: Lets consider local loss L3 and see how backprop works

$$ \Huge {\color{Purple} \frac{\partial \textrm{L}_3}{\partial \textrm{W}},\ \frac{\partial \textrm{L}_3}{\partial \textrm{U}},\ \frac{\partial \textrm{L}_3}{\partial \textrm{V}} } $$


  • We assume that g is a non-linear function and-

$$ \Huge {\color{Purple} \hat{y_3} = g(V h_3) } $$

  • Loss functions used least-squares error

$$ \Huge {\color{Purple} \mathrm{L_3} = \frac{1}{2}(y_3 - \hat{y_3})^2 } $$

light

  • We need to findout

$$ \Huge {\color{Purple} \begin{align*} & \frac{\partial \textrm{L}_3}{\partial \textrm{V}} & {\color{Black} \large \textrm{which can be expressed by- }} \\ & \frac{\partial \textrm{L}_3} {\partial \textrm{V}} = \frac{\partial \textrm{L}_3}{\partial \mathrm{\hat{y_3}}} \frac{\partial \mathrm{\hat{y_3}}}{\partial \textrm{V}} &\\ & \frac{\partial \textrm{L}_3} {\partial \textrm{V}} = - (\mathrm{y_3 - \hat{y_3}}) \mathrm{h_3} &\\ \end{align*} } $$

[To be continued]

light

⚛️ What is $\large \frac{\partial \textrm{L}_3}{\partial \textrm{W}}$ ?

$$ \Huge {\color{Purple} \begin{align*} & \frac{\partial \textrm{L}_3} {\partial \textrm{W}} = \frac{\partial \textrm{L}_3}{\partial \mathrm{\hat{y_3}}} \frac{\partial \mathrm{\hat{y_3}}}{\partial \textrm{h}_3} \frac{\partial \textrm{h}_3}{\partial \textrm{W}}&\\ & \frac{\partial \textrm{L}_3} {\partial \textrm{V}} = - (\mathrm{y_3 - \hat{y_3}}) \mathrm{h_3} \mathrm{V}&\\ \end{align*} } $$

[To be continued]

Step #3:

The gradients of that cost function are then propagated backward through the unrolled network (represented by the solid arrows).

Finally

  • The model parameters are updated using the gradients computed during BPTT.

Note that the gradients flow backward through all the outputs used by the cost function, not just through the final output (for example, in Figure the cost function is computed using the last three outputs of the network, , so gradients flow through these three outputs, but not through Y(0) and Y(1) ).

  • Moreover, since the same parameters W and b are used at each time step, backpropagation will do the right thing and sum over all time steps.
  • Fortunately, tf.keras takes care of all of this complexity for you

light light

Hidden Layers:

light

⚛️ What is the most general function we typically use within neural network?

Answer: We take linear combination followed by non-linearity always. Typically in RNNs we usually use tanh for the nonlinearity in the hidden layers.

$$\Huge{\color{Purple} \begin{align*} & \ h_t = f_w (h_{t-1}, x_t) &\\ & \Huge \boxed{h_t = tanh (W_{{\color{Cyan}hh}}h_{t-1},W_{{\color{Cyan}xh}} x_t) + \textrm{b}} & \normalsize \begin{cases} W_{{\color{Cyan}hh}} = \textit{ takes an \textbf{h} and gives out an \textbf{h}} \\ W_{{\color{Cyan}xh}} = \textit{ takes an \textbf{x} and gives out an \textbf{h}}\\ W_{{\color{Cyan}xh}}, W_{{\color{Cyan}xh}} , \textbf{b}_n = \textit{ are constant with time} \end{cases} \end{align*} } $$

  • So in this case, the non-linear function is tanh and we need a linear combination of h and x. So there will be some weight matrix W which we will multiply h and some other weight matrix W which we will multiply x.
  • Those two weight matrices are different in general. Not only that, they also have different sizes.
  • So this is the general formula for the hidden layer of an RNN, some people will replace this tanh by fw or by g.

The calculation for hiddenlayer h2 would be-

$$ \large{\color{Purple} h_2 = tanh (W_{hh}h_{1},W_{xh} x_2) + \textrm{b}} $$

Output Layer:

light The Output size is variable.

⚛️ What about this $\large{\color{Purple} \hat{y}_t}$ ?

Answer: $\large{\color{Purple} \hat{y}_t}$ is equal to some function of $\large{\color{Purple} h_t}$.

Now in some cases, it simply makes sense for this function to be a linear function( for regression output). In some cases, it makes sense for the function to be a non-linear function ( for Classification output).

  • If it is a classifcation task and let us say it is a binary classification task, then g will become a Sigmoid σ .
  • If it is a multiclass classification task, you will use a Softmax.

⚛️ Constant with time meaning

Answer: The weights and bias in h3 are the same Whh , Wxh and bn for h2

⚛️ Calculating Loss in RNN

light


RNN- Many-To-Many | The number of Layers = 'T'

Example:

  • Now when you have multiple predicted values, let us say having 10 days before is the weather of $\large{\color{Purple} h_0}$ or temperature of $\large{\color{Purple} {x}_0}$ in some city, let us say Chennai.
  • Suppose you have that input, you would have the next day's temperature, let us say that is $\large{\color{Purple} \hat{y_1}}$ , the next day's temperature $\large{\color{Purple} \hat{y_2}}$ and next day's temperature $\large{\color{Purple} \hat{y_3}}$ , till let us say today's temperature which is $\large{\color{Purple} \hat{y_T}}$ .
  • Now for each one of them, you also have a corresponding ground truth, which should be $\large{\color{Purple} y_1,\ y_2,\ y_3, \ y_T }$ . And whenever you have a ground truth and a prediction and these two differs, you will have a loss function.
  • So the total loss is -

$$ \Huge{\color{Purple} \begin{align*} & \boxed{ \textbf{L} = \sum_{t=1}^{\textrm{T}} \textbf{L}_{t} } & \Big \{ \normalsize \textit{ Summation of all the intermediate losses through the layers} \\ \end{align*} } $$

  • Now in terms of Lt itself, or the local loss function, you again have many choices but we having seen only 2 so far,
    1. cross entropy -classifcation
    2. least-squares error - regression or a numerical output

dark dark

Problems in Training Simple RNNs

dark The main problem with RNN is Unstable Gradient

The basic issue for which we had to do BPTT was because W, U, V matrices were constants across time. Because of which you had sort of recursive expressions for the loss with respect to W and the loss with respect to U.

The main issues that come up are

  • gradient calculations either explode or vanish, both of these are not ideal.
  • The gradient calculations are expensive.

The Solution

  • The solution for exploding gradients is gradient clipping.
  • The solution for vanishing gradients is alternate architectures LSTM, GRU.
  • The solution for expensive gradient calculations is Truncated Back Propagation Through Time(TBPTT).

Vanishing-Exploding Gradients and TBPTT

dark

⚫ How to identify exploding gradients?

light There are a few ways by which you can get an idea of whether your model is suffering from exploding gradients or not. They are:

  1. If the model weights become unexpectedly large in the end.
  2. Your model has a poor loss.
  3. Or the model displays NaN loss whilst training.
  4. The gradient value for error persists over 1.0 for every subsequent iteration during training.

⚫ How to identify vanishing gradients?

Link Detect Vanishing Gradients ↗️

light

Truncated Back Propagation Through Time(TBPTT)

light

TBPTT Works with both the** vanishing gradient** and the exploding gradient problems, so it is sort of a compound solution.

Now remember the example that we had before this, that example had 65,000 time sequences.

  • Now, would you go back for the full thing and come back through the full thing, by that time almost any correction you give will lead to vanishing or exploding gradient problems.
  • Plus it would become potentially very expensive just to do one gradient up-date.


TBPTT- forward propagate through 'k1' steps and back propagate through 'k2' steps

Lets consider the boxes here, each box represents an input, hidden layer and output with its corresponding loss. Now, Remember that we are assuming that the relationship is the same and in fact you can cut it anywhere in the middle and you are going to get exactly the same W.

The basic idea that is instead of training for the whole sequence , you split it up into many mini batches.

  • I will forward propagate through 2 steps, then back propagate through 2 steps. Then the W is remains the same everywhere. So I will get some new updated W.
  • Then I step forward little bit, forward propagates through another 2 steps, back propagates through 2 steps, my W is now updated.
  • Now when the W is updated, I will forward propagate through the whole thing and I keep on doing this.

All you are doing is forward propagating through one part of the data, back propagating through a different part of the data. How does this help? If you back propagates through a small amount of data, your gradient will neither blow up, nor will it vanish.

dark dark

Recap of BPTT


Typical RNN

Total Loss-

$$ \Huge{\color{Purple} \begin{align*} \mathrm{L}=\sum_{t=1}^{\mathrm{T}} \mathrm{L}_t \end{align*}} $$

⚛️ When we are calculating w if we are doing simple gradient descent

$$ \Huge{\color{Purple} \begin{align*} & \mathrm{W}= \mathrm{W} - \alpha \frac{\partial\mathrm{L}}{\partial\mathrm{W}} \\ & \large \frac{\partial\mathrm{L}}{\partial\mathrm{W}} \Big \} \textrm{This term has to be calculated as: } \sum_{t=1}^{T} \frac{\partial \mathrm{L}_t}{\partial\mathrm{W}} \\ \end{align*}} $$

We saw that you cannot simply calculate, let us say if I have L3, I cannot simply calculate in the usual way besause

$$\Huge{\color{Purple} \begin{align*} &\frac{\partial\mathrm{L}_3}{\partial\mathrm{W}} \to \frac{\partial\mathrm{h}_3}{\partial\mathrm{W}}\to \frac{\partial\mathrm{h}_2}{\partial\mathrm{W}}\to \frac{\partial\mathrm{h}_1}{\partial\mathrm{W}}\\ & \large \textrm{They involve each other} \\ \end{align*}} $$

Above is applicable for U3, as well

$$\Huge{\color{Purple} \begin{align*} &\frac{\partial\mathrm{L}_3}{\partial\mathrm{U}} \to \frac{\partial\mathrm{h}_3}{\partial\mathrm{U}}\to \frac{\partial\mathrm{h}_2}{\partial\mathrm{U}}\to \frac{\partial\mathrm{h}_1}{\partial\mathrm{U}}\\ & \large \textrm{They involve each other} \\ \end{align*}} $$

This is basically what we call back propagation through time, because none of these terms is independent. Now this kind of dependency creates several problems.

⚛️ Explanation Why is BPTT is a problem:


Heuristic Description.(rough)

$$\Huge{\color{Purple} \begin{align*} & h_t = tanh(W h_{t-1} + U x_t) \\ & \large \textrm{by cancelling } U x_t \textrm{ we get} \\ & h_t \sim tanh(W h_{t-1}) [\large \textrm{eigenvalue need to understand}]\\ & h_t \sim W h_{t-1} \large\textrm{ then}\\ & h_{t+1} \sim W^2 h_{t-1} \large [because \sim W^2 h_t \sim W^2 h_{t-1}] \\ & \large \textrm{In general- }\\ & \boxed{h_{t+1} \sim W^{n} h_{t} }\\ \end{align*}} $$

So, as you go through time, so the weight matrix keeps on constantly multiplying. So h3 would be like W2h1 and if I have something like h5, that would become W4h1 so on and so forth.

Now, all these are heuristic arguments but it turns out to be a remarkably good approximations, unfortunately I cannot go further.

  • But if I have norm(let us say 2 norm) of ∥ht+n ∥ , notice ht is a vector.
  • If I take its norm, it will be some factor times norm of ht ( ∥ht+n ∥ ∼ ∥ht∥ )
  • Norm is a scaler, so this is a number, you are trying to find out the size of ht+n, that will be some number times ht
  • And it turns out that it scales approximately as the eigenvalues (λ) of Wn. Like the following-

$$\Huge{\color{Purple} \begin{align*} & {\color{Cyan}\vec{{\color{Purple}h_{t+n}}}} \sim W^n {\color{Cyan}\vec{{\color{Purple}h_t}}} \\ & {\parallel \mathrm{h_{t+n}} \parallel }_2 \sim \lambda^n \parallel \mathrm{h_t}\parallel \\ \end{align*} } $$

  • Another way to see this is to assume that the W is diagonal, If W is diagonal, all it will have, Wn will be, all its eigenvalues or all its diagonal terms to the power n.

  • Now which eigenvalue, we will see shortly.

  • The eigenvalue will either be the largest or the smallest.

    • The worst-case scenario is if the eigenvalue will be the largest
    • The best case scenario or the smallest case scenario is if the eigenvalue will be the smallest.
  • Beacause of scaling as long as I use the same W, which I do for RNN, throughout time, what happens is these vectors constantly get larger in magnitude or constantly get smaller in magnitude.

  • So, if you have a large number of time steps, this number, even if it is small, you know, for example even if it adds to 1.01, over time it is going to get to be a huge number.(this is the power of the exponential function or of the power function)

$$\large{\color{Purple} \begin{align*} \textrm{For every 'h' }& & & \\ & \textrm{If } \huge{\mathrm{\lvert \lambda \rvert > 1}} & \textrm{ As 'n' increases }& \mathrm{\parallel h_{t+n}\parallel \to \infty }& \textrm{ (Become very large)}\\ & \textrm{If } \huge{\mathrm{\lvert \lambda \rvert < 1}} & \textrm{ As 'n' increases }& \mathrm{\parallel h_{t+n}\parallel \to 0 }\\ \end{align*} } $$

  • Now this is simply for h, you can show that and I would request you to try this out by looking at the expressions in BPTT, the similar arguments hold true for $\frac{\partial \mathrm{L_3}}{\partial \mathrm{W}}$ also.

$$\large{\color{Purple} \begin{align*} \boxed{\frac{\partial \mathrm{L_3}}{\partial \mathrm{W}} \to \frac{\partial \mathrm{h_3}}{\partial \mathrm{W} } } \to W \frac{\partial \mathrm{L_2}}{\partial \mathrm{W}}\\ \end{align*} } $$

$$\Huge{\color{Purple} \begin{align*} & \parallel \frac{\partial \textrm{L}}{\partial \textrm{W}} \parallel \to \infty & \large \textbf{Exploding Gradient} \\ & \parallel \frac{\partial \textrm{L}}{\partial \textrm{W}} \parallel \to 0 & \large \textbf{Vanishing Gradient} \\ \end{align*} \left \} \begin{matrix} \\ \large \textrm{Very Difficult to train}\\ \\ \end{matrix}\right. } $$

⚛️ Why the big number is a problem since ∞ means a big number?

Answer: These is a problems because obviously you are never going to get exactly because you are still dealing with finite number. But the problem is the moment it goes about the largest number that your machine can calculate, it will actually show you NAN, not a number or it will show you , so on and so forth. So really speaking, finite preciation machines cannot handle exploding gradient.

⚛️ Why the small number is a problem?

Answer: Similarly you will never actually go to 0. If you do like 0:991000 (because of finite preciation machines), it will be very very very small number. But the problem is it might actually becomes smaller than 10-16, which is the smallest number that you can represent accurately. So at that point you will no longer train, so that will be called saturation. So you will get a very small gradient and that is practically gone.

There is another problem, notice this tanh, even the tanh is being repeated multiple times. So you have ht = tanh(W ht-1), ht+1 will be tanh(ht), so you have tanh2, similarly you will have tanh3.


tanh2 is flat, tanh3 will look even flatter

Look at the tanh, tanh2, tanh2 will look even flatter. And if you take tanh100, it will look even smaller and notice in all these cases, gradients become flatter and flatter and flatter and they become very small.

Now, all these problems put together lead to these 2 issues. The tanh, repeated tanh problem will lead only to the vanishing gradient issue but large number of players can either lead to exploding gradient or it can lead to vanishing gradient, both of these make training very dificult.

⚛️ Gradient clipping for exploding gradients

Answer: It is very simple, we decide on a maximum allowable gradient size. What do I mean by value of gradient? Gradient is a vector, so you cannot give it a value, you can however give a value to norm of gradient.

  • Say-

$$\Huge{\color{Purple} \begin{align*} & \frac{\partial \textrm{L}}{\partial \textrm{W}} = \vec{g} & max \parallel \vec{g} \parallel = G_{max} \\ \end{align*} } $$

Gradient Descent Calculate for $\large{\color{Purple}\vec{g}}$

$$\Huge{\color{Purple} \begin{align*} & \textit{If } \parallel \vec{g} \parallel < G_{max} \large \textrm{[ Proceed as usual]} \\ & \textit{If not, } \parallel \vec{g^*} \parallel = \frac{\vec{g}}{\parallel \vec{g} \parallel} G_{max} \\ \end{align*} } $$

  • My new gradient $\vec{g^*}$ is in the same direction as the gradient you calculated but I am cutting down its size

⚛️ Solution for Vanishing gradients

Answer: Unfortunately no such simple solution exists. LSTM, GRU is the result

⚛️ Solution for expensive gradient Computation Truncated Back Propagation Through Time(TBPTT)

Answer: This solution kind of handles to a certain extent, both the vanishing gradient and the exploding gradient problems.

So, if we have data with thousands and thousands of time steps. And you want to calculate back propagation through time. Now how would you do it?

Step #1

Forward propagate through the whole thing, calculate the whole of

Step #2

Then you will back propagate through the whole thing.

Now we have 65,000 time sequences. If you go back for the full thing and come back through the full thing, by that time almost any correction you give will lead to vanishing or exploding gradient problems, plus it would become potentially very expensive just to do one gradient update.

So the solution to that is truncated back propagation through time.

Since throughout the RNN network you are going to get exactly the same W. So, instead of training for the whole sequence, you split it up into many mini batches( similar like mini batch gradient descent).

I will forward propagate through first 2 steps and back propagate through 2 steps, this is one possibility. Since W is the same everywhere. So I will get some new updated W.

Next I forward propagates through another 2 steps, back propagates through 2 steps, my W is now updated , okay. Now when the W is updated, I will forward propagate through the whole thing, okay. So I keep on doing this.

If you back propagates through a small amount of data, your gradient will neither blow up, nor will it vanish. Now what is a good rule of thumb? It is actually hard to say for some problems, hundred steps are good for some problems, 10, 20 steps are good, etc.

Deep RNNs

dark


Deep RNN

The deep RNNs are particularly important in language processing especially in language translation.

$${\color{Purple} \begin{align*} & \huge h_t^{(l)} = tanh \Big( W^{(l)}h_{t-1}^{(l)}+U^{(l)}h_{t}^{(l-1)} \Big )\\ \end{align*} } $$

Now, what are deep RNNs let us look at just one of these if I look at one of these within the RNN it is just an ANN as we saw with normal RNNs. In a normal RNN all you had was one input layer, one hidden layer and one output layer. In a deep RNN all you do is that one single layer of the RNN actually become a deep neural network that is the only difference between a deep RNN and a normal RNN.

dark dark

Different types of RNN based on Input and Output Sequences

light


Many-to-Many (top left), Many-to-One (top right), One-to-Many (bottom left), and Encoder–Decoder (bottom right) networks

Many-to-Many(Sequence-to-Sequence) Network

  • An RNN can simultaneously take a sequence of inputs and produce a sequence of outputs.
  • Example #1: Language Translation, Speech Recognition
    • The output does not comes simultaniously with the input and the size of the output need not to be same as input
  • Example #2: Video frame by frame analysis
    • The output size fixed by the input size

Many-to-One (Sequence-to-Vector) Network

  • You could feed the network a sequence of inputs and ignore all outputs except for the last one.
  • For example: Sentiment Analysis- you could feed the network a sequence of words corresponding to a movie review and the network would output a sentiment score.

One-to-Many(Vector-to-Sequence) Network

  • Conversely, you could feed the network the same input vector over and over again at each time step and let it output a sequence.
  • For example: Image Captioning- the input could be an image (or the output of a CNN), and the output could be a caption(text) for that image.

Encoder-Decoder Network

  • Lastly, you could have a sequence-to-vector network, called an encoder, followed by a vector-to-sequence network, called a decoder.
  • For example, this could be used for translating a sentence from one language to another.
    • You would feed the network a sentence in one language, the encoder would convert this sentence into a single vector representation, and then the decoder would decode this vector into a sentence in another language.
    • This two-step model, called an Encoder–Decoder, works much better than trying to translate on the fly with a single sequence-to-sequence RNN (like the one represented at the top left): the last words of a sentence can affect the first words of the translation, so you need to wait until you have seen the whole sentence before translating it.

dark dark

dark dark

Bibliography

dark

  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition by Aurélien Géron
  • NPTEL(Dr. Balaji Srinivasan)
  • YouTube 1