625 min

Beyond Text

Vision, Speech, and Translation

Module 6: Beyond Text - RNNs in Vision, Speech, and Translation

How RNNs became the universal glue connecting different AI capabilities

Until now, we have focused on RNNs for text. But the real breakthrough came when researchers realized RNNs could be the interface between different modalities. See an image? Describe it with an RNN. Hear speech? Transcribe it with an RNN. Translate between languages? Encode with one RNN, decode with another.

This module covers the encoder-decoder architecture and its applications to image captioning, machine translation, and speech recognition. These innovations from 2014-2016 represented the state of the art before Transformers - and the underlying principles still inform modern multimodal AI.

🔄Core Architecture

The Encoder-Decoder Paradigm

One architecture to transform them all

Explain to Your Stakeholders

For Your Manager

The encoder-decoder architecture is like a skilled interpreter: first, fully understand the source (encoder), then produce the output in a new form (decoder). This same pattern powers Google Translate, image captioning, and speech-to-text - any task where you need to transform one sequence into another.

The Two-Stage Process

Stage 1: Encode
Encoder
SymbolColorMeaning
blue
Encoder hidden stateThe compressed representation of the entire input sequence
green
Encoder RNNProcesses input sequence and builds up context
orange
Input sequenceThe source sequence (e.g., French sentence, image features)

The encoder reads the entire input sequence and compresses it into a fixed-size context vector - the final hidden state.

Stage 2: Decode
Decoder
SymbolColorMeaning
red
Output at time tThe generated token at each timestep
green
Decoder RNNGenerates output sequence conditioned on encoder state
blue
Encoder hidden stateContext from the encoder, typically used to initialize decoder
purple
Previous outputsPreviously generated tokens (autoregressive)

The decoder generates output one token at a time, conditioned on the context vector and its previous outputs.

Information Flow

Input Sequence
Encoder RNN
Context Vector
Decoder RNN
Output Sequence

The context vector is the information bottleneck - it must capture everything needed to generate the output.

The Bottleneck Problem

Compressing an entire sentence (or image) into a single vector works for short sequences but degrades for longer ones. This limitation motivated attention mechanisms, which we will cover in the next module.

🖼️Application 1

Image Captioning: CNN + RNN

Teaching AI to describe what it sees

Explain to Your Stakeholders

For Your Manager

Image captioning was a watershed moment: it proved AI could bridge perception and language. This technology now powers alt-text generation for accessibility, photo organization in your phone, and content moderation at scale. It is the foundation of multimodal AI.

The Architecture

Image captioning combines two powerful architectures: a CNN for visual understanding and an RNN for language generation.

1
Extract visual features

Pass image through pre-trained CNN (VGG, ResNet). Take output of penultimate layer.

2
Initialize RNN with visual features

Transform features to initialize hidden state, or feed as input at each timestep.

3
Generate caption autoregressively

RNN predicts one word at a time, feeding output back as next input.

Example Caption Generation
🐕🎾🌳

Hypothetical image

CNN extracts:

  • • Dog detector: 0.95
  • • Ball detector: 0.87
  • • Outdoor scene: 0.92
  • • Grass texture: 0.78

RNN generates:

"A dog is playing with a ball in the park"

Historical Impact

The 2014 paper "Show and Tell" (Vinyals et al.) achieved BLEU scores that were considered impossible just years earlier. This proved that deep learning could bridge perception and language - a key step toward multimodal AI systems like GPT-4V and Gemini.

🌐Application 2

Machine Translation

The application that revolutionized an industry

Explain to Your Stakeholders

For Your Manager

In 2016, Google switched from phrase-based statistical translation to neural machine translation, achieving the biggest single quality improvement in the history of the product. This architecture processed 100+ billion words daily and became the foundation for modern translation services.

Encoder-Decoder for Translation

French:"Je""suis""etudiant"h
English:h"I""am""a""student"

The context vector captures the meaning of the French sentence, allowing the decoder to generate the English translation word by word.

Key Innovations

Reverse Source Sequence

Surprisingly, reversing the source sentence improved results significantly. "Je suis etudiant" becomes "etudiant suis Je". This puts the first words of source and target closer together in the computation graph.

Bidirectional Encoding

Run two RNNs - one forward, one backward - and concatenate their hidden states. This gives each position context from both past and future.

Deep Stacking

Stack multiple LSTM layers (4-8 layers typical). Each layer operates on the hidden states of the previous layer, building increasingly abstract representations.

Beam Search

Instead of greedily picking the highest-probability word, maintain the top-k candidates and explore multiple translation paths in parallel.

The 2016 Revolution

When Google switched to Neural Machine Translation in 2016, users noticed immediately. The quality improvement was so dramatic that some initially thought it was a bug. GNMT reduced translation errors by 55-85% compared to the previous phrase-based system.

🎤Application 3

Speech Recognition

From sound waves to transcribed text

Explain to Your Stakeholders

For Your Manager

Speech recognition transformed how we interact with devices. Siri, Alexa, and Google Assistant all rely on RNN-based acoustic models. The technology has achieved human-parity on benchmark tasks and processes billions of voice queries daily.

The Speech Pipeline

1
Audio to Spectrogram

Convert raw audio waveform to mel-frequency spectrograms - 2D images of sound over time.

2
Bidirectional LSTM

Process spectrograms with stacked bidirectional LSTMs to extract acoustic features.

3
CTC Decoding

Use Connectionist Temporal Classification to align variable-length audio to text without explicit segmentation.

4
Language Model Fusion

Combine acoustic model output with language model to improve fluency and correct errors.

The Alignment Challenge

A key challenge in speech recognition: there is no one-to-one correspondence between audio frames and characters. The same word spoken at different speeds produces different numbers of frames.

Audio frames: [frame1] [frame2] [frame3] [frame4] [frame5] [frame6]
CTC output:   H       -       e       l       -       l       o
Collapsed:    "Hello"

CTC introduces a blank symbol (-) to handle variable-length alignment, then collapses repeated characters.

Application 4

Visual Question Answering (VQA)

Understanding images through natural language questions

VQA combines everything: visual understanding from CNNs, language understanding from RNNs, and reasoning to produce answers. Given an image and a question, the model must understand both and generate an appropriate response.

Example VQA Interaction
🏖️👨‍👩‍👧‍👦⛱️

Beach scene

Q: What are the people doing?

A: Playing on the beach

Q: How many people are there?

A: Four

Q: Is it sunny?

A: Yes

Architecture Overview
ImageCNNVisual Features (v)
QuestionLSTMQuestion Features (q)
v ⊕ qMLPAnswer

The visual and question features are combined (often via element-wise multiplication or concatenation) and passed through a classifier to predict the answer.

🧩Key Insight

RNNs as the Universal Glue

The remarkable insight from this era (2014-2017) was that RNNs could serve as a universal interface between different types of data. Any input that could be encoded as a sequence could be processed by an RNN, and any output that could be generated sequentially could be produced by an RNN.

Input: Anything
  • • Images (via CNN)
  • • Audio (via spectrograms)
  • • Text (via embeddings)
  • • Video (frame by frame)
Process: RNN
  • • Encode to context vector
  • • Maintain hidden state
  • • Decode autoregressively
  • • End-to-end training
Output: Anything
  • • Text (captions, translations)
  • • Labels (classification)
  • • Speech (text-to-speech)
  • • Actions (reinforcement learning)

This universality was both RNNs' greatest strength and their limitation. While they could theoretically handle any sequence task, the bottleneck of compressing information through fixed-size hidden states limited their practical performance on long sequences. The attention mechanism, which we will cover next, addressed this limitation and eventually led to the Transformer architecture that dominates today.

Summary

Key Equations
  • Encoder:
  • Decoder:
  • Image features:
Key Applications
  • Image Captioning: CNN + RNN generates descriptions
  • Translation: Encoder-decoder across languages
  • Speech: Bidirectional LSTM with CTC
  • VQA: Multimodal fusion for QA
Key Concepts
  • Encoder-decoder: Compress input, then generate output
  • Context vector: Fixed-size representation of entire input (the bottleneck)
  • Autoregressive generation: Output one token at a time, feeding back as input
  • Multimodal fusion: Combining features from different modalities

In the next module, we will see how attention mechanisms solved the bottleneck problem and revolutionized sequence modeling - eventually leading to the Transformer architecture that powers modern AI.

Test Your Knowledge

Check your understanding of encoder-decoder architectures and multimodal applications

Beyond Text - Knowledge Check

Test your understanding of encoder-decoder architectures and multimodal RNN applications.

8 questions
70% to pass