The Unreasonable Effectiveness of RNNs

Module 7: Attention Mechanisms - The Most Important Innovation

As Karpathy noted: "the most interesting recent architectural innovation"

In the previous module, we saw how encoder-decoder architectures enabled transforming one sequence into another. But we also encountered a fundamental limitation: the bottleneck problem. All information about the input had to be compressed into a single fixed-size vector.

Attention mechanisms solve this problem elegantly by allowing the decoder to "look back" at the entire input sequence. This innovation is so important that it became the foundation of modern AI - the Transformer architecture is essentially attention without the RNN.

"I think attention is one of the most interesting recent architectural innovations in neural networks." - Andrej Karpathy

🚧The Problem

The Bottleneck Problem

Why fixed-size context vectors limit performance

Explain to Your Stakeholders

Dinner Party Version

Imagine playing telephone: someone whispers a long story to you, and you have to remember everything in one thought before passing it on. Naturally, you will forget details. That is exactly what happens when an encoder compresses an entire sentence into a single vector - information gets lost, especially for long sentences.

Dinner Party Version

Imagine playing telephone: someone whispers a long story to you, and you have to remember everything in one thought before passing it on. Naturally, you will forget details. That is exactly what happens when an encoder compresses an entire sentence into a single vector - information gets lost, especially for long sentences.

The Compression Challenge

"The quick brown fox jumps over the lazy dog near the river bank"

14 words of input

→

h ∈ ℝ⁵¹²

Single 512-dim vector

→

Generate translation...

From compressed context

The same vector size must capture a 5-word sentence or a 50-word paragraph

Empirical Evidence

Research showed that BLEU scores (translation quality) degraded significantly for sentences longer than 20-30 words. The decoder simply could not recover all the necessary information from the compressed context vector.

💡The Solution

Attention: Looking Back at the Source

Let the decoder access all encoder states, not just the final one

Explain to Your Stakeholders

Dinner Party Version

Instead of trying to remember everything at once, imagine you could look back at the original text while translating. For each word you write, you glance at the relevant parts of the source. That is attention - the decoder "attends" to different parts of the input for each output word.

Dinner Party Version

Instead of trying to remember everything at once, imagine you could look back at the original text while translating. For each word you write, you glance at the relevant parts of the source. That is attention - the decoder "attends" to different parts of the input for each output word.

The Three Steps of Attention

1

Score: How relevant is each encoder state?

Attention Score

Symbol	Color	Meaning
	red	Attention score (energy)Raw score indicating how relevant encoder state i is for decoder step t
	blue	Decoder hidden stateThe decoder state from previous timestep - represents what we are trying to generate
	green	Encoder hidden state iThe encoder representation of the i-th input position
	purple	Alignment modelA small neural network (often MLP) that scores compatibility

For each decoder step t, compute a score for each encoder position i. The alignment model is typically a small neural network.

2

Normalize: Convert scores to probabilities

Attention Weights

Symbol	Color	Meaning
	orange	Attention weightNormalized probability of attending to encoder position i at decoder step t
	red	Attention scoreRaw score before normalization
	purple	Softmax functionNormalizes scores to probabilities that sum to 1

Softmax ensures weights sum to 1, creating a probability distribution over encoder positions. High scores become high weights.

3

Combine: Weighted sum of encoder states

Context Vector

Symbol	Color	Meaning
	cyan	Context vectorWeighted sum of encoder states - dynamic summary for current decoder step
	orange	Attention weightHow much to attend to position i
	green	Encoder hidden stateRepresentation of input at position i

The context vector is a weighted combination of all encoder states. Unlike the fixed bottleneck, this is different for each decoder step.

Attention in Action: Translation

Source:Lechatnoirdort

Target:Theblackcatsleeps

Attention:0.050.100.800.05

When generating "black", the model attends strongly to "noir" (0.80) - the French word for black. Different output words attend to different parts of the input.

⚖️Comparison

Soft vs Hard Attention

Two approaches to implementing attention

Explain to Your Stakeholders

Dinner Party Version

Hard attention is like pointing a spotlight at one spot - you either look at something or you do not. Soft attention is like a flashlight beam that can spread across multiple things at once, with brighter light on more important parts. Soft attention is easier to train because the "soft" gradients flow smoothly.

Dinner Party Version

Hard attention is like pointing a spotlight at one spot - you either look at something or you do not. Soft attention is like a flashlight beam that can spread across multiple things at once, with brighter light on more important parts. Soft attention is easier to train because the "soft" gradients flow smoothly.

Side-by-Side Comparison

Soft Attention

+Differentiable - standard backprop
+Easy to train end-to-end
+Weighted average of all positions
−Less interpretable (soft weights)
−Computes over all positions (O(n))

(weighted sum)

Hard Attention

+More interpretable (discrete choice)
+Only reads one position (efficient)
+Clear attention visualization
−Non-differentiable (needs REINFORCE)
−High variance gradients

(single selection)

Practical Reality

Soft attention dominates in practice because it is easier to train. The slight loss in interpretability is worth the dramatic improvement in optimization stability. When you hear "attention" in modern AI, it almost always means soft attention.

🧠Advanced Concepts

Neural Turing Machines and Memory Networks

Taking attention further: differentiable memory

Attention opened the door to a powerful idea: what if neural networks could have external memory that they read from and write to? This led to Neural Turing Machines (NTMs) and Memory Networks - architectures that use attention as a mechanism for memory access.

Neural Turing Machines (2014)

NTMs augment neural networks with an external memory matrix. The network learns to read from and write to memory using attention-based addressing.

• Content-based addressing (attention over memory)
• Location-based addressing (shifting focus)
• Differentiable read/write operations
• Can learn algorithms (copying, sorting)

Memory Networks (2014)

Memory Networks store facts in memory slots and use attention to retrieve relevant information for question answering.

• Memory = collection of embeddings
• Input → attention over memories
• Multiple "hops" for multi-step reasoning
• Foundation for later retrieval-augmented models

The Key Insight

Attention is not just for sequence-to-sequence translation - it is a general mechanism for differentiable information retrieval. Given a query, attention computes relevance scores and retrieves a weighted combination of values. This abstraction underlies modern language models' ability to "remember" context.

🌉The Future

The Bridge to Transformers

"Attention Is All You Need" (2017)

Explain to Your Stakeholders

Dinner Party Version

Attention was so powerful that researchers asked: what if we used ONLY attention, without the RNN? That is the Transformer - attention all the way down. ChatGPT, Google Search, and almost every modern AI system is built on Transformers, which are really just sophisticated attention mechanisms.

Dinner Party Version

Attention was so powerful that researchers asked: what if we used ONLY attention, without the RNN? That is the Transformer - attention all the way down. ChatGPT, Google Search, and almost every modern AI system is built on Transformers, which are really just sophisticated attention mechanisms.

The Evolution

2014

RNN + Attention: Sequential processing, attention helps decoder

2015-16

Better Attention: Multi-layer, bidirectional, more sophisticated

2017

Transformer: Remove RNN entirely, use only self-attention

2018+

BERT, GPT, etc: Scale up Transformers → modern LLMs

Self-Attention: The Key Innovation

In self-attention, a sequence attends to itself. Each position can directly attend to every other position, enabling parallel computation and better gradient flow than recurrence.

Traditional: h_t = f(h_{t-1}, x_t) — must wait for h_{t-1}

Self-Attention: h_i = Attention(x_i, X, X) — all in parallel

The Takeaway

Understanding attention is understanding the core of modern AI. Every ChatGPT response, every Google search ranking, every code completion - they all use attention at their core. Attention mechanisms are truly the most important innovation in neural network architecture.

Summary

Key Equations

Score:
Weights:
Context:

Key Concepts

• Bottleneck: Fixed context vector limits capacity
• Attention: Dynamic, weighted access to all states
• Soft vs Hard: Differentiable vs discrete selection
• Self-attention: Foundation of Transformers

In the next module, we will examine the limitations of RNNs and understand when to use them versus modern Transformer-based architectures.

Test Your Knowledge

Check your understanding of attention mechanisms

Attention Mechanisms - Knowledge Check

Test your understanding of attention mechanisms and their role in modern AI.

8 questions

70% to pass