335 min

Vanishing Gradients & LSTMs

The Problem and Its Solution

The Vanishing Gradient Problem

Vanilla RNNs have a fundamental limitation that prevents them from learning long-range dependencies. Let's understand why this happens and see it in action.

Gradient Flow During Backpropagation

During backpropagation through time (BPTT), the gradient at step 0 requires multiplying many Jacobian matrices together:

Gradient through time
SymbolColorMeaning
blue
Gradient of loss with respect to initial hidden state
green
Product over all time steps
orange
Jacobian of hidden state transition

The Numbers Don't Lie

0.520
= 0.00000095

Gradient essentially vanishes - no learning signal reaches early steps

1.020
= 1.0

Perfect preservation - the ideal case that LSTMs achieve

2.020
= 1,048,576

Gradient explodes - training becomes unstable with NaN values

Gradient Flow Through Time

Watch how gradients multiply as they flow backward through time steps. This is why vanilla RNNs struggle with long-range dependencies.

Each step multiplies by 0.5, causing exponential decay

Number of Time Steps10 steps
1 steps20 steps
t=0
t=1
t=2
t=3
t=4
t=5
t=6
t=7
t=8
t=9
t=10

Gradient at t=0 → t=10: 1.0 × 0.510 = 9.8e-4

The Vanishing Gradient Problem

When the gradient multiplier is less than 1, gradients shrink exponentially. After 10 steps, the gradient is only 9.8e-4 of its original value. This means early layers receive almost no learning signal, making it impossible to learn long-range dependencies.

Mathematical Insight

During backpropagation through time (BPTT), the gradient at step t=0 involves multiplying many Jacobian matrices: ∂h_T/∂h_0 = ∏(∂h_{t+1}/∂h_t). If the eigenvalues of these Jacobians are consistently <1 or >1, the gradient vanishes or explodes exponentially.

Explain the Vanishing Gradient Problem

Dinner Party Version

Imagine playing a game of telephone where each person only whispers 50% of what they heard. By the time the message reaches the 20th person, it's basically silence. That's what happens to learning signals in regular RNNs - they fade to nothing before reaching early parts of the sequence.

The LSTM Solution

Long Short-Term Memory (LSTM) networks, introduced by Hochreiter & Schmidhuber in 1997, solve the vanishing gradient problem with a elegant architectural change: the cell state.

LSTM Cell Architecture

Hover over gates to learn what each component does.

The LSTM Solution

Unlike vanilla RNNs where gradients multiply through transformations, the LSTM cell state C_t flows through with only element-wise operations (× and +). This creates a "gradient highway" where gradients can flow unchanged through many time steps, solving the vanishing gradient problem.

The Cell State Equation

The key innovation is the cell state , which flows through time with only element-wise operations:

Cell State Update
SymbolColorMeaning
blue
Cell state at time tThe long-term memory of the LSTM
red
Forget gate outputValues between 0 (forget) and 1 (keep)
blue
Previous cell state
green
Input gate outputControls what new information to add
orange
Candidate cell stateNew information that could be added
gray
Element-wise (Hadamard) product

Why This Solves Vanishing Gradients

Notice the addition (+) in this equation. Unlike vanilla RNNs where the hidden state is completely transformed at each step, the LSTM cell state is updated additively. When the forget gate is close to 1, the gradient flows through unchanged:

The Three Gates

1. Forget Gate

Decides what information to discard from the cell state. A value of 0 means "completely forget", while 1 means "completely keep".

SymbolColorMeaning
red
Forget gate activation (0 to 1)
gray
Sigmoid function
cyan
Forget gate weight matrix
blue
Previous hidden state
orange
Current input
gray
Forget gate bias

2. Input Gate

Decides what new information to store. Works with a candidate value created by a tanh layer.

SymbolColorMeaning
green
Input gate activation (0 to 1)
orange
Candidate values (-1 to 1)

3. Output Gate

Decides what parts of the cell state to output. The output goes through tanh to squash values to [-1, 1].

SymbolColorMeaning
magenta
Output gate activation (0 to 1)
blue
Hidden state (output)

Explain LSTMs to Your Stakeholders

Dinner Party Version

An LSTM is like a smart sticky note system. It has three decision points: (1) Should I erase what's on the note? (2) Should I write something new? (3) What should I say out loud based on the note? This lets it remember important things for a very long time while forgetting irrelevant details.

Why Long-Range Dependencies Matter

Let's look at concrete examples where vanilla RNNs fail and LSTMs succeed. In each case, the model must remember information across many tokens.

Long-Range Dependencies in Practice

These examples show why RNNs need long-term memory. The highlighted words must be connected across many intervening tokens.

📝

Simple Agreement

Distance: 1 word
Thecatsits.
Dependency:cat → sits (singular)
Challenge: Easy: minimal distance to track
📝

Complex Agreement

Distance: 12 words
Thecat,whichmysisterboughtfromtheshelterlastsummer,sits.
Dependency:cat (12 words ago) → sits (singular)
Challenge: Hard: must remember "cat" is singular across many words

Why LSTMs Excel Here

Vanilla RNNs struggle with these examples because the gradient signal from the dependent word (e.g., "sits") must travel back through many time steps to reach the source (e.g., "cat"). LSTMs solve this with their cell state, which can carry information across arbitrary distances with minimal degradation.

Key Takeaways

❌ Vanilla RNN Limitations

  • • Gradients multiply through time steps
  • • Exponential decay (vanishing) or growth (exploding)
  • • Cannot learn dependencies beyond ~10-20 steps
  • • No explicit memory mechanism

✓ LSTM Advantages

  • • Additive cell state updates
  • • Gradient highway through forget gate
  • • Can learn dependencies across hundreds of steps
  • • Explicit forget/remember mechanisms

Test Your Understanding

Take this quiz to reinforce your understanding of vanishing gradients and LSTMs. You need 70% to pass, and you can retake it as many times as you like.

Module 3 Quiz: Vanishing Gradients & LSTMs

Test your understanding of the vanishing gradient problem and how LSTMs solve it.

10 questions
70% to pass