430 min

Character-Level Modeling

Next-Character Prediction

Module 4: Character-Level Language Modeling

The core idea behind modern AI: predict the next token

This module covers the heart of Karpathy's blog post: training an RNN to predict the next character in a sequence. This deceptively simple task - when scaled up - is the foundation of ChatGPT, Claude, and every modern language model.

We'll walk through the complete pipeline: encoding characters as vectors, training with cross-entropy loss, and sampling with temperature control. By the end, you'll understand exactly what happens when you adjust the "temperature" slider in your favorite AI tool.

🎯The Core Idea

Next-Character Prediction

Learn language by predicting what comes next

Explain to Your Stakeholders

For Your Manager

Character-level language models are the conceptual ancestor of ChatGPT and Claude. The core idea - predict the next token given context - scales from single characters to the sophisticated AI assistants your company uses today. Understanding this helps you grasp why AI can be both remarkably capable and surprisingly limited.

The "hello" Example

Consider training on the single word "hello". At each position, the model must predict the next character:

Input:"h"Target:"e"
Input:"he"Target:"l"
Input:"hel"Target:"l"
Input:"hell"Target:"o"

The model learns: after "h", "e" is likely. After "hel", "l" is likely. After "hell", "o" is likely. Scale this to millions of words, and the model learns language.

Step 1: One-Hot Encoding

Converting characters to vectors the network can process

Neural networks work with numbers, not characters. We convert each character to a one-hot vector: a vector of zeros with a single 1 at the character's index.

Example: Vocabulary = {h, e, l, o}
'h'[1, 0, 0, 0]index 0
'e'[0, 1, 0, 0]index 1
'l'[0, 0, 1, 0]index 2
'o'[0, 0, 0, 1]index 3
Why One-Hot?

One-hot encoding treats each character as equally different from every other character. 'a' isn't "closer" to 'b' than to 'z' - they're all orthogonal vectors. The network learns meaningful relationships during training.

For a vocabulary of characters, each input is a -dimensional vector. Real character-level models might have (letters, digits, punctuation). Byte-level models use .

Step 2: Cross-Entropy Loss

Measuring how wrong the model's predictions are

At each timestep, the model outputs a probability distribution over all possible next characters. We measure the quality of these predictions using cross-entropy loss:

Cross-Entropy Loss
SymbolColorMeaning
red
Total loss (cross-entropy)What we minimize during training - lower is better
blue
Predicted probability of next characterThe model's confidence in its prediction
green
Character at position tThe actual character in the training sequence
purple
Natural logarithmConverts probabilities to log-probabilities for numerical stability

We sum the negative log-probability of the correct next character at each position. Lower loss = better predictions.

Concrete Example

After seeing "hel", suppose the model predicts:

p('l') = 0.7← correct!
p('o') = 0.1
p('e') = 0.1
p('h') = 0.1

Loss for this timestep:

If the model had predicted p('l') = 0.99, loss would be ≈0.01 (much better!)

Perplexity: A Human-Readable Metric

Perplexity = (exponentiated average loss). It represents "how many characters the model is confused between on average." A perplexity of 5 means the model is as uncertain as if randomly choosing between 5 characters. Good character models achieve perplexity around 1.5-2.

🌡️Key Concept

Temperature-Scaled Sampling

Controlling creativity vs. consistency in generation

Explain to Your Stakeholders

For Your Manager

When you adjust the temperature slider in ChatGPT or Claude, you are controlling this exact parameter. Low temperature (0.1-0.3) gives focused, deterministic outputs good for factual tasks. High temperature (0.7-1.0) produces more creative, varied responses - useful for brainstorming but with higher error risk.

The Temperature Equation

Temperature-Scaled Softmax
SymbolColorMeaning
red
Probability of character iOutput probability after temperature scaling
blue
Logit (raw score) for character iUnnormalized output from the network
orange
Temperature parameterControls randomness: low = deterministic, high = random
green
Exponentiated scaled logitSoftmax numerator with temperature scaling

Dividing logits by temperature τ before softmax controls the "sharpness" of the distribution.

Temperature in Action

Given logits [2.0, 1.0, 0.5, 0.1] for characters ['e', 'a', 'i', 'o']:

τ = 0.5 (Focused)
'e':88%
'a':10%
'i':2%
'o':<1%

Almost always picks 'e'

τ = 1.0 (Normal)
'e':54%
'a':26%
'i':13%
'o':7%

Trained distribution

τ = 2.0 (Creative)
'e':35%
'a':28%
'i':21%
'o':16%

More willing to try alternatives

Connection to Modern AI

When you set temperature in ChatGPT, Claude, or other AI assistants, you're adjusting this exact parameter. The underlying principle hasn't changed since Karpathy's character-level RNNs - it's just scaled to trillions of parameters and tokens instead of characters.

The Training Process

How the model learns from data

1
Feed a sequence

Input characters one at a time, updating hidden state at each step

2
Predict next character

At each timestep, output probability distribution over vocabulary

3
Compute loss

Cross-entropy between predictions and actual next characters

4
Backpropagate through time

Compute gradients and update weights (W_hh, W_xh, W_hy)

5
Repeat millions of times

Process different sequences, gradually improving predictions

What the Model Learns

Early Training
  • • Common letter frequencies
  • • Basic letter combinations ("th", "qu")
  • • Word boundaries (spaces)
Later Training
  • • Complete words and spelling
  • • Grammar and sentence structure
  • • Style and long-range patterns

Generating Text

Sampling from the trained model

Once trained, we can generate text by sampling from the model's predictions:

# Generation loop
seed = "The "
for i in range(100):
probabilities = model(seed)
next_char = sample(probabilities, temperature=τ)
seed = seed + next_char

Sample Outputs by Temperature

Same Shakespeare-trained model, different temperatures:

τ = 0.5Conservative

"The king is the more the state of the state of the people..."

Coherent but repetitive

τ = 1.0Balanced

"The king doth wake to-night and takes his rouse, keeps wassail..."

Good variety and coherence

τ = 1.5Creative

"The kinghBrol'd-Loss?ump thee, veck'd shalg remond..."

Creative but less coherent

💡Key Insight

The Unreasonable Power of Next-Token Prediction

The remarkable thing about character-level modeling is how much emerges from such a simple objective. The model isn't explicitly taught spelling, grammar, or style - it learns all of these as a byproduct of predicting the next character well.

What We Teach

"Given these characters, predict the next one"

What It Learns
  • • Spelling and vocabulary
  • • Grammar and syntax
  • • Style and voice
  • • Long-range coherence

This same principle - scaled from characters to tokens, from megabytes to terabytes, from single GPUs to massive clusters - is what powers today's large language models. The foundation is exactly what you've learned in this module.

Summary

Key Equations
  • Loss:
  • Temperature:
  • Perplexity:
Key Concepts
  • • One-hot encoding for characters
  • • Cross-entropy measures prediction quality
  • • Temperature controls creativity vs. consistency
  • • Autoregressive generation (feed output back as input)

In the next module, we'll see what happens when we apply this technique to different types of data - from Shakespeare to Wikipedia to Linux source code - and explore the surprising capabilities that emerge.