Character-Level Modeling
Next-Character Prediction
Module 4: Character-Level Language Modeling
The core idea behind modern AI: predict the next token
This module covers the heart of Karpathy's blog post: training an RNN to predict the next character in a sequence. This deceptively simple task - when scaled up - is the foundation of ChatGPT, Claude, and every modern language model.
We'll walk through the complete pipeline: encoding characters as vectors, training with cross-entropy loss, and sampling with temperature control. By the end, you'll understand exactly what happens when you adjust the "temperature" slider in your favorite AI tool.
Next-Character Prediction
Learn language by predicting what comes next
Explain to Your Stakeholders
For Your Manager
For Your Manager
The "hello" Example
Consider training on the single word "hello". At each position, the model must predict the next character:
The model learns: after "h", "e" is likely. After "hel", "l" is likely. After "hell", "o" is likely. Scale this to millions of words, and the model learns language.
Step 1: One-Hot Encoding
Converting characters to vectors the network can process
Neural networks work with numbers, not characters. We convert each character to a one-hot vector: a vector of zeros with a single 1 at the character's index.
Example: Vocabulary = {h, e, l, o}
Why One-Hot?
One-hot encoding treats each character as equally different from every other character. 'a' isn't "closer" to 'b' than to 'z' - they're all orthogonal vectors. The network learns meaningful relationships during training.
For a vocabulary of characters, each input is a -dimensional vector. Real character-level models might have (letters, digits, punctuation). Byte-level models use .
Step 2: Cross-Entropy Loss
Measuring how wrong the model's predictions are
At each timestep, the model outputs a probability distribution over all possible next characters. We measure the quality of these predictions using cross-entropy loss:
| Symbol | Color | Meaning |
|---|---|---|
red | Total loss (cross-entropy)What we minimize during training - lower is better | |
blue | Predicted probability of next characterThe model's confidence in its prediction | |
green | Character at position tThe actual character in the training sequence | |
purple | Natural logarithmConverts probabilities to log-probabilities for numerical stability |
We sum the negative log-probability of the correct next character at each position. Lower loss = better predictions.
Concrete Example
After seeing "hel", suppose the model predicts:
Loss for this timestep:
If the model had predicted p('l') = 0.99, loss would be ≈0.01 (much better!)
Perplexity: A Human-Readable Metric
Perplexity = (exponentiated average loss). It represents "how many characters the model is confused between on average." A perplexity of 5 means the model is as uncertain as if randomly choosing between 5 characters. Good character models achieve perplexity around 1.5-2.
Temperature-Scaled Sampling
Controlling creativity vs. consistency in generation
Explain to Your Stakeholders
For Your Manager
For Your Manager
The Temperature Equation
| Symbol | Color | Meaning |
|---|---|---|
red | Probability of character iOutput probability after temperature scaling | |
blue | Logit (raw score) for character iUnnormalized output from the network | |
orange | Temperature parameterControls randomness: low = deterministic, high = random | |
green | Exponentiated scaled logitSoftmax numerator with temperature scaling |
Dividing logits by temperature τ before softmax controls the "sharpness" of the distribution.
Temperature in Action
Given logits [2.0, 1.0, 0.5, 0.1] for characters ['e', 'a', 'i', 'o']:
τ = 0.5 (Focused)
Almost always picks 'e'
τ = 1.0 (Normal)
Trained distribution
τ = 2.0 (Creative)
More willing to try alternatives
Connection to Modern AI
When you set temperature in ChatGPT, Claude, or other AI assistants, you're adjusting this exact parameter. The underlying principle hasn't changed since Karpathy's character-level RNNs - it's just scaled to trillions of parameters and tokens instead of characters.
The Training Process
How the model learns from data
Input characters one at a time, updating hidden state at each step
At each timestep, output probability distribution over vocabulary
Cross-entropy between predictions and actual next characters
Compute gradients and update weights (W_hh, W_xh, W_hy)
Process different sequences, gradually improving predictions
What the Model Learns
Early Training
- • Common letter frequencies
- • Basic letter combinations ("th", "qu")
- • Word boundaries (spaces)
Later Training
- • Complete words and spelling
- • Grammar and sentence structure
- • Style and long-range patterns
Generating Text
Sampling from the trained model
Once trained, we can generate text by sampling from the model's predictions:
Sample Outputs by Temperature
Same Shakespeare-trained model, different temperatures:
"The king is the more the state of the state of the people..."
Coherent but repetitive
"The king doth wake to-night and takes his rouse, keeps wassail..."
Good variety and coherence
"The kinghBrol'd-Loss?ump thee, veck'd shalg remond..."
Creative but less coherent
The Unreasonable Power of Next-Token Prediction
The remarkable thing about character-level modeling is how much emerges from such a simple objective. The model isn't explicitly taught spelling, grammar, or style - it learns all of these as a byproduct of predicting the next character well.
What We Teach
"Given these characters, predict the next one"
What It Learns
- • Spelling and vocabulary
- • Grammar and syntax
- • Style and voice
- • Long-range coherence
This same principle - scaled from characters to tokens, from megabytes to terabytes, from single GPUs to massive clusters - is what powers today's large language models. The foundation is exactly what you've learned in this module.
Summary
Key Equations
- Loss:
- Temperature:
- Perplexity:
Key Concepts
- • One-hot encoding for characters
- • Cross-entropy measures prediction quality
- • Temperature controls creativity vs. consistency
- • Autoregressive generation (feed output back as input)
In the next module, we'll see what happens when we apply this technique to different types of data - from Shakespeare to Wikipedia to Linux source code - and explore the surprising capabilities that emerge.