RNN Architecture
Building Memory into Networks
The Core RNN Architecture
Understanding how recurrent neural networks build memory into neural computation
At its heart, an RNN is a neural network with a feedback loop. Unlike feedforward networks that process each input independently, RNNs maintain a hidden state that carries information from previous timesteps. This hidden state is the network's “memory” - it's how the network remembers what it has seen before.
The architecture is surprisingly simple: just three weight matrices and a non-linear activation function. Yet this simple structure can learn to model remarkably complex sequential patterns.
Explain RNNs to Your Stakeholders
Dinner Party Version
Dinner Party Version
The Hidden State Update
This is the heart of the RNN - how memory is updated at each timestep
| Symbol | Color | Meaning |
|---|---|---|
blue | Hidden state at time tThe "memory" of the network at the current timestep | |
blue | Previous hidden stateMemory from the previous timestep | |
red | Input at time tThe current input vector (e.g., a character embedding) | |
green | Hidden-to-hidden weight matrixLearned parameters that transform the previous hidden state | |
orange | Input-to-hidden weight matrixLearned parameters that transform the input | |
purple | Hidden bias termLearned offset added to the computation |
Understanding the Equation
The hidden state equation combines two sources of information: what the network remembers (via ) and what it's currently seeing (via ). The activation squashes everything to the range [-1, 1], preventing values from exploding.
Concrete Examples
Example 1: Character-Level Language Model
Processing the word “hello” character by character:
Example 2: Numerical Computation
A simplified example with 2D hidden state and 1D input:
Example 3: Sentiment Tracking
How hidden state might track sentiment in a review:
The Output Computation
Transforming hidden state into predictions
| Symbol | Color | Meaning |
|---|---|---|
magenta | Raw output at time tUnnormalized scores (logits) for each possible output | |
cyan | Hidden-to-output weight matrixLearned parameters that transform hidden state to output | |
blue | Current hidden stateThe memory representation at this timestep | |
purple | Output bias termLearned offset for the output layer |
Understanding the Equation
This is a simple linear transformation that projects the hidden state into the output space. If we're predicting the next character from a vocabulary of 65 characters, would be a 65 × matrix, producing 65 raw scores (logits) - one for each possible character.
Concrete Examples
Example 1: Character Prediction
Projecting a 128-dim hidden state to a 65-character vocabulary:
Example 2: Numerical Computation
Simplified example with 2D hidden state and 3 output classes:
Example 3: Many-to-Many Architecture
Output at every timestep (e.g., POS tagging):
The Softmax Function
Converting raw scores into a probability distribution
| Symbol | Color | Meaning |
|---|---|---|
green | Probability distribution at time tProbabilities for each possible next character/token | |
magenta | Raw output scores (logits)Unnormalized predictions from the network | |
gray | Exponentiated score for class iConverts logits to positive values |
Understanding the Equation
Softmax transforms the raw output scores (logits) into probabilities that sum to 1. The exponentiation ensures all values are positive, and the normalization by the sum ensures they form a valid probability distribution. Higher logits get higher probabilities, with the differences amplified by the exponential.
Concrete Examples
Example 1: Basic Softmax Computation
Converting logits [2.0, 1.0, 0.1] to probabilities:
Example 2: Temperature Scaling
Temperature controls the “sharpness” of the distribution:
Example 3: Sampling from the Distribution
Using probabilities to generate the next character:
"Training RNNs is Optimization Over Programs"
Andrej Karpathy's profound observation about what RNNs really are
"If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs."
This insight is crucial: RNNs don't just learn static input-output mappings. Because they process sequences step-by-step with internal state, they're actually learning algorithms - procedures that maintain and update memory as they process data.
Feedforward Networks
Learn functions:
Fixed computation, no memory, process each input independently.
Recurrent Networks
Learn programs with state: loops, conditionals, memory
Dynamic computation that adapts based on what's been seen.
This is why RNNs are Turing complete - given enough units and appropriate weights, they can theoretically compute anything a computer can compute. In practice, this manifests as RNNs learning to count, track parentheses, maintain context across long sequences, and implement complex conditional logic - all emergent behaviors from the simple recurrent update equation.
Putting It All Together
The complete RNN forward pass at each timestep:
The same three weight matrices () are used at every timestep - this is the beauty of parameter sharing. In the next module, we'll explore what happens when we try to train these weights through backpropagation through time, and why long sequences cause the infamous vanishing gradient problem.