230 min

RNN Architecture

Building Memory into Networks

The Core RNN Architecture

Understanding how recurrent neural networks build memory into neural computation

At its heart, an RNN is a neural network with a feedback loop. Unlike feedforward networks that process each input independently, RNNs maintain a hidden state that carries information from previous timesteps. This hidden state is the network's “memory” - it's how the network remembers what it has seen before.

The architecture is surprisingly simple: just three weight matrices and a non-linear activation function. Yet this simple structure can learn to model remarkably complex sequential patterns.

Explain RNNs to Your Stakeholders

Dinner Party Version

Imagine you're reading a sentence word by word. As you read each word, you don't forget what came before - you keep a running understanding in your head. An RNN works the same way: it processes sequences one element at a time, maintaining a 'memory' (hidden state) that gets updated with each new input. It's like a person with a notepad who reads a book one word at a time, constantly scribbling notes about what they've seen so far.
1Core Equation

The Hidden State Update

This is the heart of the RNN - how memory is updated at each timestep

SymbolColorMeaning
blue
Hidden state at time tThe "memory" of the network at the current timestep
blue
Previous hidden stateMemory from the previous timestep
red
Input at time tThe current input vector (e.g., a character embedding)
green
Hidden-to-hidden weight matrixLearned parameters that transform the previous hidden state
orange
Input-to-hidden weight matrixLearned parameters that transform the input
purple
Hidden bias termLearned offset added to the computation

Understanding the Equation

The hidden state equation combines two sources of information: what the network remembers (via ) and what it's currently seeing (via ). The activation squashes everything to the range [-1, 1], preventing values from exploding.

Concrete Examples

Example 1: Character-Level Language Model

Processing the word “hello” character by character:

t=0: , (no prior state)
t=1: ,
t=2: ,
...and so on. Each h encodes context from all previous characters.
Example 2: Numerical Computation

A simplified example with 2D hidden state and 1D input:

Given: ,
Weights: ,
Example 3: Sentiment Tracking

How hidden state might track sentiment in a review:

“The movie was” neutral, awaiting judgment
“The movie was terrible” shifts strongly negative
“The movie was terrible... ly good!” reverses to positive
The hidden state continuously updates, allowing the network to “change its mind” as new information arrives.
2Output Layer

The Output Computation

Transforming hidden state into predictions

SymbolColorMeaning
magenta
Raw output at time tUnnormalized scores (logits) for each possible output
cyan
Hidden-to-output weight matrixLearned parameters that transform hidden state to output
blue
Current hidden stateThe memory representation at this timestep
purple
Output bias termLearned offset for the output layer

Understanding the Equation

This is a simple linear transformation that projects the hidden state into the output space. If we're predicting the next character from a vocabulary of 65 characters, would be a 65 × matrix, producing 65 raw scores (logits) - one for each possible character.

Concrete Examples

Example 1: Character Prediction

Projecting a 128-dim hidden state to a 65-character vocabulary:

(128-dimensional hidden state)
(weight matrix)
(one score per character)
After “hel”, y_t might have high scores for ‘l’ and ‘p’, low scores for ‘z’ and ‘q’.
Example 2: Numerical Computation

Simplified example with 2D hidden state and 3 output classes:

Given:
,
Class 0 (score 1.05) is most likely before softmax.
Example 3: Many-to-Many Architecture

Output at every timestep (e.g., POS tagging):

Input: [“The”, “cat”, “sat”]
t=0: → [“DET”, “NOUN”, “VERB”, ...] scores → predict “DET”
t=1: → predict “NOUN”
t=2: → predict “VERB”
Each output uses context from all previous words via the hidden state.
3Probability Layer

The Softmax Function

Converting raw scores into a probability distribution

SymbolColorMeaning
green
Probability distribution at time tProbabilities for each possible next character/token
magenta
Raw output scores (logits)Unnormalized predictions from the network
gray
Exponentiated score for class iConverts logits to positive values

Understanding the Equation

Softmax transforms the raw output scores (logits) into probabilities that sum to 1. The exponentiation ensures all values are positive, and the normalization by the sum ensures they form a valid probability distribution. Higher logits get higher probabilities, with the differences amplified by the exponential.

Concrete Examples

Example 1: Basic Softmax Computation

Converting logits [2.0, 1.0, 0.1] to probabilities:

The highest logit (2.0) gets 66% of the probability mass.
Example 2: Temperature Scaling

Temperature controls the “sharpness” of the distribution:

Logits: [2.0, 1.0, 0.1]
T=1.0: (standard)
T=0.5: (sharper, more confident)
T=2.0: (softer, more uniform)
Lower temperature → more deterministic. Higher → more random/creative.
Example 3: Sampling from the Distribution

Using probabilities to generate the next character:

After "Shakespear", probabilities for next char:
(most likely)
(plausible)
(combined)
We sample from this distribution rather than always picking the max (argmax), which adds variety to generation.
💡Key Insight

"Training RNNs is Optimization Over Programs"

Andrej Karpathy's profound observation about what RNNs really are

"If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs."

This insight is crucial: RNNs don't just learn static input-output mappings. Because they process sequences step-by-step with internal state, they're actually learning algorithms - procedures that maintain and update memory as they process data.

Feedforward Networks

Learn functions:

Fixed computation, no memory, process each input independently.

Recurrent Networks

Learn programs with state: loops, conditionals, memory

Dynamic computation that adapts based on what's been seen.

This is why RNNs are Turing complete - given enough units and appropriate weights, they can theoretically compute anything a computer can compute. In practice, this manifests as RNNs learning to count, track parentheses, maintain context across long sequences, and implement complex conditional logic - all emergent behaviors from the simple recurrent update equation.

Putting It All Together

The complete RNN forward pass at each timestep:

1
Update hidden state:
2
Compute output logits:
3
Convert to probabilities:

The same three weight matrices () are used at every timestep - this is the beauty of parameter sharing. In the next module, we'll explore what happens when we try to train these weights through backpropagation through time, and why long sequences cause the infamous vanishing gradient problem.