Beyond Text
Vision, Speech, and Translation
Module 6: Beyond Text - RNNs in Vision, Speech, and Translation
How RNNs became the universal glue connecting different AI capabilities
Until now, we have focused on RNNs for text. But the real breakthrough came when researchers realized RNNs could be the interface between different modalities. See an image? Describe it with an RNN. Hear speech? Transcribe it with an RNN. Translate between languages? Encode with one RNN, decode with another.
This module covers the encoder-decoder architecture and its applications to image captioning, machine translation, and speech recognition. These innovations from 2014-2016 represented the state of the art before Transformers - and the underlying principles still inform modern multimodal AI.
The Encoder-Decoder Paradigm
One architecture to transform them all
Explain to Your Stakeholders
For Your Manager
For Your Manager
The Two-Stage Process
Stage 1: Encode
| Symbol | Color | Meaning |
|---|---|---|
blue | Encoder hidden stateThe compressed representation of the entire input sequence | |
green | Encoder RNNProcesses input sequence and builds up context | |
orange | Input sequenceThe source sequence (e.g., French sentence, image features) |
The encoder reads the entire input sequence and compresses it into a fixed-size context vector - the final hidden state.
Stage 2: Decode
| Symbol | Color | Meaning |
|---|---|---|
red | Output at time tThe generated token at each timestep | |
green | Decoder RNNGenerates output sequence conditioned on encoder state | |
blue | Encoder hidden stateContext from the encoder, typically used to initialize decoder | |
purple | Previous outputsPreviously generated tokens (autoregressive) |
The decoder generates output one token at a time, conditioned on the context vector and its previous outputs.
Information Flow
The context vector is the information bottleneck - it must capture everything needed to generate the output.
The Bottleneck Problem
Compressing an entire sentence (or image) into a single vector works for short sequences but degrades for longer ones. This limitation motivated attention mechanisms, which we will cover in the next module.
Image Captioning: CNN + RNN
Teaching AI to describe what it sees
Explain to Your Stakeholders
For Your Manager
For Your Manager
The Architecture
Image captioning combines two powerful architectures: a CNN for visual understanding and an RNN for language generation.
Pass image through pre-trained CNN (VGG, ResNet). Take output of penultimate layer.
Transform features to initialize hidden state, or feed as input at each timestep.
RNN predicts one word at a time, feeding output back as next input.
Example Caption Generation
Hypothetical image
CNN extracts:
- • Dog detector: 0.95
- • Ball detector: 0.87
- • Outdoor scene: 0.92
- • Grass texture: 0.78
RNN generates:
"A dog is playing with a ball in the park"
Historical Impact
The 2014 paper "Show and Tell" (Vinyals et al.) achieved BLEU scores that were considered impossible just years earlier. This proved that deep learning could bridge perception and language - a key step toward multimodal AI systems like GPT-4V and Gemini.
Machine Translation
The application that revolutionized an industry
Explain to Your Stakeholders
For Your Manager
For Your Manager
Encoder-Decoder for Translation
The context vector captures the meaning of the French sentence, allowing the decoder to generate the English translation word by word.
Key Innovations
Reverse Source Sequence
Surprisingly, reversing the source sentence improved results significantly. "Je suis etudiant" becomes "etudiant suis Je". This puts the first words of source and target closer together in the computation graph.
Bidirectional Encoding
Run two RNNs - one forward, one backward - and concatenate their hidden states. This gives each position context from both past and future.
Deep Stacking
Stack multiple LSTM layers (4-8 layers typical). Each layer operates on the hidden states of the previous layer, building increasingly abstract representations.
Beam Search
Instead of greedily picking the highest-probability word, maintain the top-k candidates and explore multiple translation paths in parallel.
The 2016 Revolution
When Google switched to Neural Machine Translation in 2016, users noticed immediately. The quality improvement was so dramatic that some initially thought it was a bug. GNMT reduced translation errors by 55-85% compared to the previous phrase-based system.
Speech Recognition
From sound waves to transcribed text
Explain to Your Stakeholders
For Your Manager
For Your Manager
The Speech Pipeline
Convert raw audio waveform to mel-frequency spectrograms - 2D images of sound over time.
Process spectrograms with stacked bidirectional LSTMs to extract acoustic features.
Use Connectionist Temporal Classification to align variable-length audio to text without explicit segmentation.
Combine acoustic model output with language model to improve fluency and correct errors.
The Alignment Challenge
A key challenge in speech recognition: there is no one-to-one correspondence between audio frames and characters. The same word spoken at different speeds produces different numbers of frames.
CTC introduces a blank symbol (-) to handle variable-length alignment, then collapses repeated characters.
Visual Question Answering (VQA)
Understanding images through natural language questions
VQA combines everything: visual understanding from CNNs, language understanding from RNNs, and reasoning to produce answers. Given an image and a question, the model must understand both and generate an appropriate response.
Example VQA Interaction
Beach scene
Q: What are the people doing?
A: Playing on the beach
Q: How many people are there?
A: Four
Q: Is it sunny?
A: Yes
Architecture Overview
The visual and question features are combined (often via element-wise multiplication or concatenation) and passed through a classifier to predict the answer.
RNNs as the Universal Glue
The remarkable insight from this era (2014-2017) was that RNNs could serve as a universal interface between different types of data. Any input that could be encoded as a sequence could be processed by an RNN, and any output that could be generated sequentially could be produced by an RNN.
Input: Anything
- • Images (via CNN)
- • Audio (via spectrograms)
- • Text (via embeddings)
- • Video (frame by frame)
Process: RNN
- • Encode to context vector
- • Maintain hidden state
- • Decode autoregressively
- • End-to-end training
Output: Anything
- • Text (captions, translations)
- • Labels (classification)
- • Speech (text-to-speech)
- • Actions (reinforcement learning)
This universality was both RNNs' greatest strength and their limitation. While they could theoretically handle any sequence task, the bottleneck of compressing information through fixed-size hidden states limited their practical performance on long sequences. The attention mechanism, which we will cover next, addressed this limitation and eventually led to the Transformer architecture that dominates today.
Summary
Key Equations
- Encoder:
- Decoder:
- Image features:
Key Applications
- • Image Captioning: CNN + RNN generates descriptions
- • Translation: Encoder-decoder across languages
- • Speech: Bidirectional LSTM with CTC
- • VQA: Multimodal fusion for QA
Key Concepts
- • Encoder-decoder: Compress input, then generate output
- • Context vector: Fixed-size representation of entire input (the bottleneck)
- • Autoregressive generation: Output one token at a time, feeding back as input
- • Multimodal fusion: Combining features from different modalities
In the next module, we will see how attention mechanisms solved the bottleneck problem and revolutionized sequence modeling - eventually leading to the Transformer architecture that powers modern AI.
Test Your Knowledge
Check your understanding of encoder-decoder architectures and multimodal applications
Beyond Text - Knowledge Check
Test your understanding of encoder-decoder architectures and multimodal RNN applications.