Vanishing Gradients & LSTMs
The Problem and Its Solution
The Vanishing Gradient Problem
Vanilla RNNs have a fundamental limitation that prevents them from learning long-range dependencies. Let's understand why this happens and see it in action.
Gradient Flow During Backpropagation
During backpropagation through time (BPTT), the gradient at step 0 requires multiplying many Jacobian matrices together:
| Symbol | Color | Meaning |
|---|---|---|
blue | Gradient of loss with respect to initial hidden state | |
green | Product over all time steps | |
orange | Jacobian of hidden state transition |
The Numbers Don't Lie
Gradient essentially vanishes - no learning signal reaches early steps
Perfect preservation - the ideal case that LSTMs achieve
Gradient explodes - training becomes unstable with NaN values
Gradient Flow Through Time
Watch how gradients multiply as they flow backward through time steps. This is why vanilla RNNs struggle with long-range dependencies.
Each step multiplies by 0.5, causing exponential decay
Gradient at t=0 → t=10: 1.0 × 0.510 = 9.8e-4
The Vanishing Gradient Problem
When the gradient multiplier is less than 1, gradients shrink exponentially. After 10 steps, the gradient is only 9.8e-4 of its original value. This means early layers receive almost no learning signal, making it impossible to learn long-range dependencies.
Mathematical Insight
During backpropagation through time (BPTT), the gradient at step t=0 involves multiplying many Jacobian matrices: ∂h_T/∂h_0 = ∏(∂h_{t+1}/∂h_t). If the eigenvalues of these Jacobians are consistently <1 or >1, the gradient vanishes or explodes exponentially.
Explain the Vanishing Gradient Problem
Dinner Party Version
Dinner Party Version
The LSTM Solution
Long Short-Term Memory (LSTM) networks, introduced by Hochreiter & Schmidhuber in 1997, solve the vanishing gradient problem with a elegant architectural change: the cell state.
LSTM Cell Architecture
Hover over gates to learn what each component does.
The LSTM Solution
Unlike vanilla RNNs where gradients multiply through transformations, the LSTM cell state C_t flows through with only element-wise operations (× and +). This creates a "gradient highway" where gradients can flow unchanged through many time steps, solving the vanishing gradient problem.
The Cell State Equation
The key innovation is the cell state , which flows through time with only element-wise operations:
| Symbol | Color | Meaning |
|---|---|---|
blue | Cell state at time tThe long-term memory of the LSTM | |
red | Forget gate outputValues between 0 (forget) and 1 (keep) | |
blue | Previous cell state | |
green | Input gate outputControls what new information to add | |
orange | Candidate cell stateNew information that could be added | |
gray | Element-wise (Hadamard) product |
Why This Solves Vanishing Gradients
Notice the addition (+) in this equation. Unlike vanilla RNNs where the hidden state is completely transformed at each step, the LSTM cell state is updated additively. When the forget gate is close to 1, the gradient flows through unchanged:
The Three Gates
1. Forget Gate
Decides what information to discard from the cell state. A value of 0 means "completely forget", while 1 means "completely keep".
| Symbol | Color | Meaning |
|---|---|---|
red | Forget gate activation (0 to 1) | |
gray | Sigmoid function | |
cyan | Forget gate weight matrix | |
blue | Previous hidden state | |
orange | Current input | |
gray | Forget gate bias |
2. Input Gate
Decides what new information to store. Works with a candidate value created by a tanh layer.
| Symbol | Color | Meaning |
|---|---|---|
green | Input gate activation (0 to 1) | |
orange | Candidate values (-1 to 1) |
3. Output Gate
Decides what parts of the cell state to output. The output goes through tanh to squash values to [-1, 1].
| Symbol | Color | Meaning |
|---|---|---|
magenta | Output gate activation (0 to 1) | |
blue | Hidden state (output) |
Explain LSTMs to Your Stakeholders
Dinner Party Version
Dinner Party Version
Why Long-Range Dependencies Matter
Let's look at concrete examples where vanilla RNNs fail and LSTMs succeed. In each case, the model must remember information across many tokens.
Long-Range Dependencies in Practice
These examples show why RNNs need long-term memory. The highlighted words must be connected across many intervening tokens.
Simple Agreement
Distance: 1 wordComplex Agreement
Distance: 12 wordsWhy LSTMs Excel Here
Vanilla RNNs struggle with these examples because the gradient signal from the dependent word (e.g., "sits") must travel back through many time steps to reach the source (e.g., "cat"). LSTMs solve this with their cell state, which can carry information across arbitrary distances with minimal degradation.
Key Takeaways
❌ Vanilla RNN Limitations
- • Gradients multiply through time steps
- • Exponential decay (vanishing) or growth (exploding)
- • Cannot learn dependencies beyond ~10-20 steps
- • No explicit memory mechanism
✓ LSTM Advantages
- • Additive cell state updates
- • Gradient highway through forget gate
- • Can learn dependencies across hundreds of steps
- • Explicit forget/remember mechanisms
Test Your Understanding
Take this quiz to reinforce your understanding of vanishing gradients and LSTMs. You need 70% to pass, and you can retake it as many times as you like.
Module 3 Quiz: Vanishing Gradients & LSTMs
Test your understanding of the vanishing gradient problem and how LSTMs solve it.