Implementation Deep Dive
From NumPy to PyTorch to Hugging Face
From Theory to Code
Three paths from understanding to implementation
You've learned the theory. Now it's time to build. This module offers three implementation tracks, each suited to different goals: deep understanding, production deployment, or rapid prototyping.
Choose your track based on what you want to learn and how much time you have. Many practitioners go through all three: NumPy for understanding, PyTorch for custom solutions, and Hugging Face for quick experiments.
Explain Implementation Choices to Stakeholders
Dinner Party Version
Dinner Party Version
Choose Your Implementation Track
Track B: PyTorch Production
Production-ready implementation with LSTM, GPU training, and proper validation.
Advantages
- + GPU acceleration
- + Automatic differentiation
- + Production-ready patterns
Trade-offs
- - Some framework abstraction
- - Need to understand PyTorch
- - More boilerplate
Complete Implementation
"""
Production-ready character-level LSTM in PyTorch
"""
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
class CharDataset(Dataset):
def __init__(self, text, seq_length):
self.text = text
self.seq_length = seq_length
self.chars = sorted(list(set(text)))
self.char_to_idx = {c: i for i, c in enumerate(self.chars)}
self.idx_to_char = {i: c for i, c in enumerate(self.chars)}
self.vocab_size = len(self.chars)
def __len__(self):
return len(self.text) - self.seq_length
def __getitem__(self, idx):
chunk = self.text[idx:idx + self.seq_length + 1]
input_seq = torch.tensor([self.char_to_idx[c] for c in chunk[:-1]])
target_seq = torch.tensor([self.char_to_idx[c] for c in chunk[1:]])
return input_seq, target_seq
class CharLSTM(nn.Module):
def __init__(self, vocab_size, embed_size, hidden_size, num_layers, dropout=0.5):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.embedding = nn.Embedding(vocab_size, embed_size)
self.lstm = nn.LSTM(
embed_size, hidden_size, num_layers,
batch_first=True, dropout=dropout if num_layers > 1 else 0
)
self.dropout = nn.Dropout(dropout)
self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, x, hidden=None):
embed = self.embedding(x)
output, hidden = self.lstm(embed, hidden)
output = self.dropout(output)
logits = self.fc(output)
return logits, hidden
def init_hidden(self, batch_size, device):
h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device)
c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device)
return (h0, c0)
def train_epoch(model, dataloader, criterion, optimizer, device, clip=1.0):
model.train()
total_loss = 0
for batch_idx, (inputs, targets) in enumerate(dataloader):
inputs, targets = inputs.to(device), targets.to(device)
batch_size = inputs.size(0)
hidden = model.init_hidden(batch_size, device)
optimizer.zero_grad()
outputs, _ = model(inputs, hidden)
loss = criterion(outputs.view(-1, outputs.size(-1)), targets.view(-1))
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), clip) # Gradient clipping!
optimizer.step()
total_loss += loss.item()
return total_loss / len(dataloader)
def generate(model, dataset, seed_text, length=200, temperature=0.8, device='cuda'):
model.eval()
chars = [dataset.char_to_idx[c] for c in seed_text]
hidden = model.init_hidden(1, device)
# Process seed
for char_idx in chars[:-1]:
x = torch.tensor([[char_idx]]).to(device)
_, hidden = model(x, hidden)
# Generate
generated = list(seed_text)
x = torch.tensor([[chars[-1]]]).to(device)
for _ in range(length):
logits, hidden = model(x, hidden)
probs = torch.softmax(logits[0, 0] / temperature, dim=0)
char_idx = torch.multinomial(probs, 1).item()
generated.append(dataset.idx_to_char[char_idx])
x = torch.tensor([[char_idx]]).to(device)
return ''.join(generated)
# Training script
if __name__ == '__main__':
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Hyperparameters
EMBED_SIZE = 128
HIDDEN_SIZE = 512
NUM_LAYERS = 2
DROPOUT = 0.5
SEQ_LENGTH = 100
BATCH_SIZE = 64
LEARNING_RATE = 0.002
EPOCHS = 50
# Load data
with open('shakespeare.txt', 'r') as f:
text = f.read()
dataset = CharDataset(text, SEQ_LENGTH)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)
model = CharLSTM(
dataset.vocab_size, EMBED_SIZE, HIDDEN_SIZE, NUM_LAYERS, DROPOUT
).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3)
for epoch in range(EPOCHS):
loss = train_epoch(model, dataloader, criterion, optimizer, device)
scheduler.step(loss)
print(f'Epoch {epoch+1}, Loss: {loss:.4f}')
if (epoch + 1) % 5 == 0:
sample = generate(model, dataset, 'ROMEO:', length=200)
print(f'\nSample:\n{sample}\n')Backpropagation Through Time (BPTT)
How gradients flow backward through sequences
Training RNNs requires computing gradients across all timesteps. The chain rule creates a product of Jacobians that either vanishes or explodes:
| Symbol | Color | Meaning |
|---|---|---|
blue | Gradient of loss w.r.t. weightsWhat we need to compute for learning | |
green | Gradient at timestep tError signal at each position in sequence | |
orange | Jacobian between timestepsHow hidden state at t depends on t-1 |
Exploding Gradients
When , the product grows exponentially.
Solution: Gradient clipping
Vanishing Gradients
When , the product shrinks to zero.
Solution: LSTM/GRU gating
Truncated BPTT
In practice, we don't backpropagate through the entire sequence. Instead, we split into chunks of length (typically 25-100) and only propagate gradients within each chunk. This trades some gradient accuracy for computational efficiency and memory savings.
Training Tips & Best Practices
Techniques that make RNN training actually work
Gradient Clipping
Prevent exploding gradients by capping gradient norms
# PyTorch torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # TensorFlow optimizer = tf.keras.optimizers.Adam(clipnorm=1.0)
Why: RNNs multiply gradients across time steps. Without clipping, gradients can explode to infinity, causing NaN losses.
Adam Optimizer
Adaptive learning rates per parameter
# Recommended settings for RNNs
optimizer = torch.optim.Adam(
model.parameters(),
lr=0.001, # Start here
betas=(0.9, 0.999),
eps=1e-8
)Why: Adam adapts learning rates based on gradient history, handling the varying scales of RNN gradients better than SGD.
Dropout
Regularization to prevent overfitting
# Apply to non-recurrent connections
self.lstm = nn.LSTM(
input_size, hidden_size,
dropout=0.5, # Between LSTM layers
num_layers=2
)
self.dropout = nn.Dropout(0.5) # After LSTMWhy: RNNs easily overfit to training sequences. Dropout randomly zeros activations during training, forcing redundant representations.
Learning Rate Schedule
Reduce learning rate when loss plateaus
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer,
mode='min',
factor=0.5,
patience=3
)
# After each epoch:
scheduler.step(val_loss)Why: Starting with a higher LR finds good regions quickly; reducing it allows fine-tuning without overshooting.
Troubleshooting: Common Failure Modes
When things go wrong and how to fix them
| Symptom | Likely Causes | Solutions | Severity |
|---|---|---|---|
| Loss is NaN |
|
| critical |
| Loss stuck high |
|
| high |
| Generates gibberish |
|
| medium |
| Repeats same phrase |
|
| medium |
| Training very slow |
|
| low |
| Out of memory |
|
| high |
Implementation Checklist
Start with the simplest track that meets your needs (usually Track C)
Always use gradient clipping - exploding gradients will ruin your training
Use LSTM/GRU over vanilla RNN unless you have a specific reason not to
Monitor both training and validation loss - overfitting is common
Start with proven hyperparameters, then tune incrementally
Save checkpoints frequently - training can be unstable
Test Your Knowledge
Check your understanding of RNN implementation and training techniques
Implementation Deep Dive - Knowledge Check
Test your understanding of RNN implementation, training techniques, and troubleshooting.