A SUBTLE Introduction to LSTM

6 min readNov 12, 2020

When was the last time you ever started thinking about something from the base? Honestly, we do not because we always start thinking about something related to something. As we read something, we understand the sentence with respect to the surrounding words. You do not forget everything and start thinking from scratch again; your thoughts have persistence. Traditionally neural networks cant do this. It is hard to imagine how a neural network would classify a statement or a sentence when it cannot reason with the other statements in a paragraph.

RNN’s or Recurrent Neural Networks are built to address this issue. They loop the information in them to hold on to the information; in other words, they are multiple copies of the same network, each sending processed data to the next, which in turn processes the data and sends it forward. Due to this chain-like behavior, we can say that RNN’s are very closely related to lists and sequences, and they can be pretty much used for working on the example mentioned earlier and for many more use-cases such as speech recognition, language modeling, translation, image captioning and so on.

One of the major turning points of RNN is its ability to connect previous information to the present task, such as using the previous video frames to understand the current frame, but this is not possible all the time. For example, if we are trying to predict an object’s movement in a video frame w.r.t to 5 previous frames, it is easy. However, when you try to predict the object’s movement w.r.t to some other characteristic from a few frames back like say 20–30, the result would not be that accurate, or there would not be a result. In other words, as the gap grows between the current task and the previous tasks, RNN becomes less usable. In theory, RNNs can handle such “long-term dependencies” when a human carefully picks parameters for them to solve the problems of this form. Sadly, in practice, RNNs do not seem to be able to learn them. Hence, we use LSTM’s to work around this issue, as they seem not to be affected by this issue.

The LSTM NETWORK

LSTM or Long-Short Term Memory is a form of RNN, capable of learning long-term data. Introduced in 1997, they have been developed rigorously. Moreover, LSTM work well on a variety of problems.

As you can see, LSTM has a lot going on inside it, and because of these minute functions, LSTM’s are very handy for long-term dependencies. The above layer is a part of many more LSTM layers linked in the form of a chain.

However, looking at this, the internals has much more going on inside. Let us go through each of these functions one by one.

In the LSTM CHAIN diagram, each line carries an entire vector, from the output on one node to the other’s input. The pink circles are the operations, like addition, subtraction, and the yellow boxes are the neural network layers. The more prominent blue and purple circles represent the input and output.

LSTM WORKING

LSTM’s have the unique ability called the cell state; it is just a horizontal line that runs through the LSTM block, with only some minor interactions, making it easy for information to flow through unchanged, but also the data flowing can be changed, the changes carefully regulated by structures called gates.

As the name suggests, gates are a way to let data pass through conditionally. They are built as sigmoid structures and a pointwise operation. Sigmoid has only two outputs, either a one or a zero, where “one” represents to allow the data to flow and zero the vice versa.

Now LSTM has multiple gates to regulate data flow. Let us go through them as the data flow from the input to the output.

FORGET GATE

The first step in the LSTM block is to decide which data to forget and which data to keep in the cell state. The decision is made with the help of a forget gate. It looks at the output from the prev LSTM block and the new input, then passes the data into a sigmoid layer, and then parses it into the cell state with the help of a pointwise multiplication operator. The output of this gate would either be to forget the data entirely or to keep it.

For example, when using LSTM to predict the next word of the present sentence based on the gender of the subject. Now when go to a different sentence, LSTM can use this forget gate to forget the gender of the previous subject’s gender.

Forget gate equation

Ft = forget gate output
Ht-1 = output of prev LSTM block
Xt = new input

INPUT GATE

The next step is to decide which information or data we will store in the cell state. This process has two steps: the “input gate layer” and a “tanh layer”. The “input gate layer”, basically a sigmoid layer, decides which value to update. Simultaneously, the “tanh layer” creates a vector of new values that would be added to the state. We then combine both these values to create an update to the state, which is then added to the cell state with the help of a point-wise addition operator.

Lets take our previous example, since we have forgotten our previous gender, we would now like to update the value, and for that we use the input gate.

Cell state update merge

It = input gate output
Ht-1 = output of prev LSTM block
Xt = new input
Ćt = vector output of tanh
Ct = cell state
Ft = forget gate output
Ct-1 = cell state of prev LSTM

OUTPUT GATE

Finally, we need to decide what the LSTM block is going to output. This output will be based upon the cell state. First, we run a sigmoid layer on the concatenated data of the previous LSTM block output and the new input; we then put the cell state through a “tanh layer”. We then multiply both the outputs, giving out the final output. This output is then later used by the next LSTM block, helping it to process the data more and give out a more precise output.

Looking at our previous example, we just got our new subject from the input gate. We now have to decide the wordings for the sentence relevant to the new subject.

Ot = output gate
Ht-1 = output of prev LSTM block
Xt = new input
tanh(Ct) = processing of Cell state in the tanh layer.
Ht = output

This is how an LSTM works. All the above-mentioned process is a part of one block, which will now run in a chain connecting to many more blocks. Over the years, LSTM has gone through many changes, some major and some minor, but all pertain to a specific task. Some of the major ones are Peephole LSTM, Peephole convolutional LSTM, Multiplicative LSTM, LSTMs With Attention and many more. LSTM’s with attention is so powerful that it was used to power google translate and is still used to this day.

Conclusion

LSTM’s are a remarkable piece of algorithms and have made possible to work on many problems. They are big step in the field of RNN’s. All in all a very good algorithm to work on problems requiring the algorithm to remember older data.