Making RNNs Actually Work: LSTMs, Bidirectionality, and the Encoder-Decoder

A developer has detailed how stacked and bidirectional RNNs improve upon simple recurrent networks, explaining that stacking layers creates hierarchical representations while bidirectionality provides each word with both left and right context. The post demonstrates how a bidirectional RNN solves the problem of ambiguous sentiment, such as in the phrase "terribly exciting," by concatenating forward and backward hidden states.

Last post ended with a simple RNN and three promises: LSTMs, bidirectional RNNs, and attention. This post delivers the first two, plus the refinements that turn a working-on-paper RNN into something you'd actually deploy. By the end, you'll know how to stack RNNs for depth, why reading a sentence backward as well as forward bidirectionality makes representations sharper, how the encoder-decoder turns one sequence into a different one for machine translation, and exactly what breaks in a simple RNN that LSTMs and GRUs were invented to fix. Attention, the fix for the last problem we'll hit, gets its own home in the transformer post, so we'll stop right at the edge of it. The first refinement is the easy one. Nothing says an RNN's output has to go straight to a prediction. You can feed the entire output sequence of one RNN as the input sequence to another. Then another. These are stacked RNNs also called deep RNNs , and they usually outperform a single layer. Why does depth help? The same reason it helps in vision. Each layer learns representations at a different level of abstraction. The lower layers pick up fundamental, local properties; in language, that's roughly the level of parts of speech and named entities. The higher layers compose those into bigger groupings: descriptive phrases, "this is the answer to a question," and so on. We can't point at layer 3 and say "this one does coreference." But the theory holds up well enough that researchers have probed models like BERT and found specific layers that seem to specialize in things like coreference resolution. The intuition is real even when the boundaries are fuzzy. How many layers should you stack? It depends. More layers cost more to train, and they need more data to be worth it. If you don't have much training data, a deep stack may hurt rather than help. And in encoder-decoder setups, the two sides don't even have to match; it's common for the encoder to use 2-4 layers while the decoder uses more. At the top of the stack, you typically put a feedforward network that produces whatever you actually need: a softmax over the vocabulary, a sigmoid for binary classification, a distribution over tags. The RNN stack builds the representation; the feedforward head reads it out. Here's a problem the simple RNN has that's easy to miss. Take sentiment analysis on this sentence: The movie was terribly exciting. Walk an RNN left to right. When it reaches terribly , the hidden state at that point is a representation of the word terribly in this context , but only the context to its left . Everything from the start of the sentence up to terribly , and nothing after. That's a problem here, because terribly almost always carries a negative connotation. On its own, the left context says "this review is going negative." But the very next word is exciting , which flips the meaning completely. Terribly exciting is positive. The RNN, reading left to right, doesn't have exciting available when it builds the representation of terribly . The fix: read the sentence in both directions . A bidirectional RNN runs two separate RNNs over the same input. One goes left to right forward . One goes right to left backward , so the first word the backward RNN sees is the final token, then the second-to-last, and so on. At each position, you concatenate the forward and backward hidden states into a single representation. Now the representation of terribly sees both its left context the movie was and its right context exciting . For a subjective task like sentiment, where any extra context helps, this is a real improvement. It costs more computation; you're running two RNNs, but the accuracy usually justifies it. The forward and backward RNNs generally have separate weights , and the "RNN" in each direction can be anything: a vanilla RNN, an LSTM, even a transformer. Bidirectionality is a wrapper, not a specific cell. This is also a preview of BERT Bidirectional Encoder Representations from Transformers , which we'll meet later. BERT produces excellent contextual word representations, and the "B" is exactly this idea — bidirectionality — scaled up with transformers instead of RNNs. BERT is an encoder: it doesn't generate text, it builds a representation of the input that you then use for some downstream task. Bidirectional RNNs have a second, less obvious use: representing words from their characters . For some languages, the vocabulary is simply too large to represent as a fixed embedding table. Hindi is a good example; it has rich morphology, where a base word can carry a long train of ligatures that mark singular/plural, case, possession, and more. A lot of meaning is packed into that morphology, and it blows up the vocabulary size. The fix: build a word's embedding from a bidirectional RNN run over its character sequence . Take the word Janet . Run an LSTM left to right over J-a-n-e-t , run another right to left, and concatenate. That concatenation becomes a character-level word embedding , one that captures prefixes, suffixes, and spelling patterns even for words the model has never seen as a whole. These are especially useful for morphologically complex languages, and also for messy domains like social media think repeated letters, lol , exclamation points, all character-level signals . The common move is to combine both : concatenate the traditional word embedding with the character-derived embedding, and feed that joint vector into the RNN. You get the best of both, the lexical meaning from the word embedding, and the morphological/spelling signal from the characters. Let's step back and lay out the four architectural patterns side by side. They look similar in a diagram, but they do very different things, and knowing which one a task calls for is half the battle. 1. Sequence labeling : Input x1,…,xnx 1, \ldots, x nx1,…,xn , output y1,…,yny 1, \ldots, y ny1,…,yn , same length, one label per token. Part-of-speech tagging, NER. 2. Sequence classification : Input x1,…,xnx 1, \ldots, x nx1,…,xn , output a single label. Sentiment analysis, spam detection. 3. Language modeling : Input x1,…,xt−1x 1, \ldots, x {t-1}x1,…,xt−1 , output the next token xtx txt , which then becomes the input for predicting xt+1x {t+1}xt+1 . Autoregressive generation. 4. Encoder-decoder : Input x1,…,xnx 1, \ldots, x nx1,…,xn compressed into a context representation, which a decoder expands into output y1,…,ymy 1, \ldots, y my1,…,ym , and crucially, the output length mmm need not equal the input length nnn . The last point matters. Machine translation rarely has a word-to-word correspondence; the target sentence might be longer, shorter, or reordered. That's why translation needs the fourth pattern and not the first. And the punchline: the "RNN" in any of these can be swapped for an LSTM or a transformer. The architecture is the recipe; the cell is an ingredient you can substitute. The fourth pattern deserves a closer look, because it's where the next big problem shows up. Take machine translation. English in, Spanish out: the green witch arrived → llegó la bruja verde The encoder reads the source sentence with an RNN, carrying its hidden state forward as usual. After the last source word, you add a separator token to mark the boundary between source and target. The encoder's final hidden state is, in theory, a good encoding of the entire source sentence. This is the context vector , ccc . The decoder takes that context and generates the translation autoregressively, one word at a time, each new word conditioned on the previous one. It produces llegó , feeds that back in to produce la , and so on. Look closely at that third equation. The context vector ccc doesn't just initialize the decoder, it's fed into every decoder timestep, alongside the previous output and previous hidden state. The encoder-decoder works. But it has a structural weakness, and seeing it clearly is the whole reason attention exists. Everything the decoder knows about the source has to pass through one fixed-size vector , ccc , the encoder's final hidden state. For the green witch arrived , that's fine. For a long sentence, it's a problem on two fronts. First, a single vector struggles to hold the full meaning of a long sentence and keep track of the individual words and their relationships. Translation needs both, meaning and word-level correspondence, and one vector is a tight squeeze. Second, we already know simple RNNs lose information from early in the sequence. By the time the encoder reaches the last source word, the contribution of the first word to that final hidden state may be faint. So the context vector is biased toward the end of the source sentence. Consider this sentence, to make the distance problem vivid: Natalie took a train to O'Hare and then a plane to LA, and then a plane to Tokyo, and then a plane to Miyazaki, where she finally Ubered to her hotel. To resolve she , and then her , back to Natalie , the model has to carry that reference across the entire sentence. Look at the distance. A simple RNN's hidden state, even though it theoretically reaches back to the start, leaks too much over that span to do this reliably. This is the bottleneck, and the eventual fix is attention , letting the decoder look back at all the encoder's hidden states, not just the final one. But attention is the heart of the transformer, so we'll pick it up there. For now, just hold the open question: how do we keep long-distance information from washing out? The bottleneck is about inference. There's a related problem in training , and it has a name we met briefly last time: the vanishing gradient . When you backpropagate the error through a long sequence, you multiply gradients together at every step. If those gradients are small, the product shrinks toward zero. The signal from far back in the sequence becomes so faint that the model can't learn from it. And here's the subtle part — when that signal vanishes, you can't tell why : The training procedure can't distinguish those two cases, which means real long-distance dependencies quietly go unlearned. There's a clean linguistic illustration. Take this language-modeling prompt: The writer of the books Should it be is or are ? The correct answer is is , it agrees with writer , the head of the phrase. This is syntactic recency : the grammatically relevant word is writer , several tokens back. But the word immediately before the blank is books , a plural noun. A model biased toward the nearest word sequential recency gets tempted into are . And it turns out simple RNNs are much better at sequential recency than syntactic recency. They reach for the nearby plural and get the agreement wrong. It's the same mistake people make when speaking quickly. The solution isn't a trick. It's a better cell. Which brings us to LSTMs. This is the big one — the fix for the problem that's been lurking under every section so far. The thread of hidden state frays on long sequences. The bottleneck squeezes. The gradient vanishes. All three are the same underlying failure: a single hidden state asked to carry everything, decaying as it goes. The Long Short-Term Memory network is built around one idea: instead of forcing a single hidden state to do everything, give the cell an explicit mechanism for deciding what to remember , what to forget , and what to expose . Think about how your own memory works. Walking up to a building, you remember whether the door was on the left or the right. You do not remember how tall the grass was outside. Your brain is ruthlessly efficient about discarding what won't matter later and holding onto what will. That's the behavior an LSTM tries to learn. A simple RNN has one hidden state, hth tht . An LSTM adds a second: the cell state , ctc tct . The cell state is the long-term memory — the conveyor belt of information that runs forward through time, and that the network can read from, write to, and erase. The hidden state is more like the working output at each step. The reading, writing, and erasing are controlled by gates . Each gate is a sigmoid; it outputs values between 0 and 1, acting like a soft mask. Multiply something by a gate near 1, and it passes through; multiply by a gate near 0, and it's erased. The gates are dynamic : their values are computed fresh at every timestep from the current input and previous hidden state. The forget gate decides what to keep versus drop from the previous cell state: The input gate decides which parts of the new candidate content get written into the cell: The candidate is the new information we might add, computed just like a vanilla RNN's hidden update: The new cell state combines the two: forget some of the old, write some of the new. The ⊙\odot⊙ is element-wise multiplication. The output gate decides what part of the cell to expose as this step's hidden state: Quick reference: the three LSTM gates Forget gate f — how much of the previous cell state to keep versus erase. Input gate i — how much of the new candidate content to write into the cell. Output gate o — how much of the cell to expose as this step's hidden state. It looks like a lot of equations, but the shape is simple: compute three gates, use them to forget part of the old cell, write part of a new candidate, and read out a hidden state. Every weight matrix in there is learned during training . Nobody hand-codes what to forget; the network figures out, for the task at hand, what's worth carrying forward. A useful gut check: suppose you have a simple RNN with 32 units and an LSTM with 32 units. Which takes longer to train? The LSTM , and the reason is right there in the equations. It has three gates plus a candidate, which means roughly four sets of weight matrices to learn instead of one. More gates, more weights, more compute. The extra power isn't free. LSTMs slot into all the patterns we've seen. A common one is the bidirectional LSTM for tagging: run word + character embeddings through a forward and backward LSTM, concatenate, and label each token. For the sentence we are playing soccer , that produces pronoun / verb / verb / noun. LSTMs shine wherever temporal structure genuinely matters: sentiment, translation, sequence tagging. They're useful well beyond NLP too, in things like time-series analysis. From roughly 2013 to 2015, they were the dominant approach. By around 2019, transformers had largely taken over. But the lineage runs straight through them. The Gated Recurrent Unit is the LSTM's leaner sibling. Same motivation gating to manage long-distance information but simpler. The key difference: a GRU has no separate cell state . It folds everything into the hidden state, managed by two gates instead of three: The full GRU equations for the curious Update gate: zₜ = σ W z · xₜ + U z · hₜ₋₁ Reset gate: rₜ = σ W r · xₜ + U r · hₜ₋₁ Candidate: h̃ₜ = tanh W · xₜ + U · rₜ ⊙ hₜ₋₁ Hidden state: hₜ = 1 − zₜ ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ The reset gate rₜ picks which parts of the old hidden state feed the candidate; the update gate zₜ blends old and new in one shot — the job that the forget and input gates do separately in an LSTM. Because there's no cell state, there are fewer weight matrices to learn, which makes a GRU faster to train than an LSTM and easier on memory. It helps with the vanishing gradient problem as well. The practical rule of thumb: start with an LSTM . If training is too slow or you need more efficiency, switch to a GRU; you'll usually get comparable results with less cost. Previous post + this post walked through a progression of recurrent units. Lined up, they tell a clean story: And the practical takeaways for building with them: Eight things from this lecture: Stacked RNNs add depth, with lower layers learning fundamental features POS, NER and higher layers learning semantic groupings. More layers cost more and need more data. Bidirectional RNNs read the sequence forward and backward and concatenate, so each position sees both left and right context. Great for sentiment and tagging, but illegal for language modeling, where the right context doesn't exist yet. Character-level bi-RNNs build word embeddings from character sequences, handling rich morphology Hindi and messy text. Often concatenated with traditional word embeddings. Four NLP architectures : sequence labeling, sequence classification, language modeling, and encoder-decoder, the last one allowing output length to differ from input length. The recurrent cell inside any of them is swappable. The encoder-decoder compresses a source sequence into a context vector ccc and decodes a new sequence from it, feeding ccc into every decoder step so the output stays anchored to the input rather than drifting. The bottleneck problem : forcing everything through one fixed context vector loses long-distance information, especially for long sentences. This is the open door that attention next post walks through. Syntactic vs. sequential recency : simple RNNs favor the nearest word over the grammatically correct one "the writer of the books are " , a symptom of vanishing gradients on long-range dependencies. LSTMs and GRUs : gated cells that learn what to remember and what to forget. LSTMs add an explicit cell state and three gates; GRUs drop the cell state for speed. Both tame the vanishing gradient far better than a vanilla RNN, at the cost of more weights to train.