Chapter 3: Recurrent Neural Network

9 min read

What this chapter is ultimately trying to achieve

To introduce a type of neural network specifically designed to process sequences of data (like words in a sentence) one element at a time, while maintaining an internal “memory” or “state” that captures information from previous elements in the sequence. This “memory” allows RNNs to understand context that spans multiple tokens, which is something BoW or simple n-gram models struggle with significantly.

Let’s break down the key concepts:

3.1 Elman RNN (Simple Recurrent Neural Network)

What it’s ultimately trying to achieve: To process a sequence of inputs (e.g., word embeddings) step-by-step, and at each step, produce an output and update an internal hidden state. This hidden state acts as a compressed summary of the sequence seen so far.
The Core Idea (The Loop): Imagine a standard neural network unit. Now, give it a loop: the output of the unit at a given time step t (specifically, its hidden state h_t) is fed back into the unit as an additional input at the next time step t+1, along with the actual next input from the sequence x_{t+1}.
- Input: At each time step t, the RNN unit takes two things:
  1. The current input from the sequence, x_t (e.g., the embedding of the current word).
  2. The hidden state from the previous time step, h_{t-1}.
- Calculation: Inside the unit, these inputs are typically transformed by weight matrices and an activation function (often tanh in classic RNNs) to produce:
  1. The new hidden state for the current time step, h_t.
  2. (Optionally) An output for the current time step, y_t. For language modeling, this y_t would be related to predicting the next word.
- Formula (Conceptual): h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h) (Hidden state update) y_t = W_hy * h_t + b_y (Output at time t, often passed through softmax for probabilities) Where W_hh, W_xh, W_hy are weight matrices and b_h, b_y are bias terms. These weights are shared across all time steps, which is key to how RNNs generalize.
Visualizing It: You can “unroll” an RNN in time. It looks like a chain of identical network units, where the hidden state from one unit is passed to the next.
Realism and Challenges:
- Vanishing/Exploding Gradients: When training RNNs with backpropagation through time (BPTT), gradients can become very small (vanish) or very large (explode) as they are propagated back through many time steps. This makes it hard for simple RNNs to learn long-range dependencies (e.g., connecting a word at the beginning of a long sentence to a word at the end). ReLU helps with vanishing gradients compared to tanh/sigmoid in deep feedforward nets, but the recurrent nature still poses challenges. LSTMs and GRUs (which are more advanced RNN variants, not deeply covered in a 100-page book but important to know about) were developed to mitigate this.

3.2 Mini-Batch Gradient Descent (Revisited for Sequences)

What it’s ultimately trying to achieve: To efficiently train RNNs (and other large models) by processing multiple sequences in parallel within each training step, rather than one sequence at a time or the entire dataset at once.
The Setup for Sequences: When we feed data to an RNN, it’s often in the shape of (batch_size, sequence_length, embedding_dimensionality).
- batch_size: The number of sequences processed together.
- sequence_length: The number of tokens in each sequence (sequences are often padded to be the same length in a batch).
- embedding_dimensionality: The size of the vector representing each token.
Why it’s important: Processing batches leverages the parallel processing capabilities of modern hardware (like GPUs), making training much faster. It also provides a more stable estimate of the gradient compared to processing single examples (stochastic gradient descent).

3.3 Programming an RNN (in PyTorch)

What it’s ultimately trying to achieve: To translate the mathematical concept of an RNN unit and a stack of RNN layers into working code.
Key PyTorch Components:
- nn.Module: The base class for all neural network modules in PyTorch. Our RNN unit and the full RNN model will inherit from this.
- nn.Parameter: Wraps a tensor to tell PyTorch that it’s a learnable model parameter (like the weight matrices W_hh, W_xh).
- The __init__ method: Where you define the layers and parameters of your model.
- The forward method: Where you define how the input data flows through the layers to produce an output. For an RNN, this will involve a loop over the time steps of the input sequence.
Implementing the ElmanRNNUnit:
- Initialize weight matrices (Uh for hidden-to-hidden, Wh for input-to-hidden) and a bias term (b).
- The forward method takes current input x and previous hidden state h_prev and computes h_new = tanh(x @ Wh + h_prev @ Uh + b).
Implementing the full ElmanRNN (stacking layers):
- The ElmanRNN class would contain a list of ElmanRNNUnit instances (one for each layer).
- Its forward method would:
  1. Initialize hidden states for all layers (usually to zeros).
  2. Loop through each token (time step t) in the input sequences of the batch.
  3. For each token, loop through each RNN layer:
    - The input to the first layer is the token’s embedding.
    - The input to subsequent layers is the hidden state from the layer below at the same time step.
    - Each layer updates its hidden state.
  4. Collect the outputs (usually the hidden states of the last layer at each time step).

3.4 RNN as a Language Model

What it’s ultimately trying to achieve: To use the RNN architecture to perform the core language modeling task: predicting the next token in a sequence.
The Architecture:
1. Embedding Layer: Converts input token IDs into dense embedding vectors. This is often nn.Embedding in PyTorch.
2. RNN Layers: One or more RNN layers (like our ElmanRNN) process the sequence of embeddings and output a sequence of hidden states (usually from the final RNN layer).
3. Output (Linear) Layer / Classification Head: A fully connected linear layer takes the RNN’s hidden state output at each time step t and transforms it into a vector of logits, where the size of this vector is the vocabulary size.
4. Softmax (implicitly with CrossEntropyLoss): These logits are then (conceptually, often combined within the loss function) passed through a softmax function to get probabilities for each word in the vocabulary being the next word.
Training:
- Input Sequence: A sequence of token IDs, e.g., [token_A, token_B, token_C].
- Target Sequence: The input sequence shifted by one position, e.g., [token_B, token_C, token_D].
- At each time step t, the model processes input_token_t and its goal is to output a high probability for target_token_t.
- The cross-entropy loss is calculated between the predicted probability distribution and the actual target token at each position, and then averaged.

3.5 Embedding Layer (Deeper Dive with `nn.Embedding`)

What it’s ultimately trying to achieve: To provide a learnable lookup table that maps discrete token indices (integers) to dense, continuous-valued embedding vectors.
How it works in PyTorch (nn.Embedding):
- When you create nn.Embedding(vocab_size, emb_dim), PyTorch initializes a weight matrix of shape (vocab_size, emb_dim) with random values. Each row i of this matrix is the embedding vector for token ID i.
- When you pass a tensor of token IDs to this layer, it simply looks up and returns the corresponding rows (embedding vectors).
- These embedding vectors are learnable parameters. During training, gradients flow back to them, and they get updated to better represent the tokens for the given task.
- padding_idx: You can specify an index to be treated as a padding token. The embedding for this token will be a zero vector and (importantly) will not be updated during training.

3.6 Training an RNN Language Model (The Full Loop in PyTorch)

What it’s ultimately trying to achieve: To put all the pieces together – data preparation, model instantiation, loss function, optimizer, and the training loop – to actually train an RNN LM.
Key Steps in the Training Loop (per epoch, per batch):
1. model.train(): Set the model to training mode.
2. Get input_seq and target_seq from the DataLoader.
3. Move data to the correct device (CPU/GPU).
4. optimizer.zero_grad(): Clear old gradients.
5. outputs = model(input_seq): Forward pass to get logits.
6. Reshape outputs and target_seq so that the loss can be computed across all tokens in the batch efficiently. Typically, this means flattening them so that each row corresponds to a single token prediction:
  - outputs becomes (batch_size * seq_len, vocab_size)
  - target_seq becomes (batch_size * seq_len)
7. loss = criterion(outputs, target_seq): Calculate the cross-entropy loss. Remember nn.CrossEntropyLoss in PyTorch expects raw logits and handles the softmax internally. It also allows an ignore_index parameter, which is crucial for not calculating loss on padding tokens in the target_seq.
8. loss.backward(): Backward pass to compute gradients.
9. optimizer.step(): Update model parameters.
Reproducibility: Setting seeds (random.seed(), torch.manual_seed(), torch.cuda.manual_seed_all()) is important for consistent results, especially when debugging or comparing experiments.

3.7 Dataset and DataLoader (PyTorch Utilities)

What they are ultimately trying to achieve: To provide a standardized and efficient way to load, preprocess, and iterate over data in batches during training.
- Dataset: An abstract class representing your dataset. You need to implement:
  - __init__(self, ...): Load/prepare your data (e.g., read from file, tokenize).
  - __len__(self): Return the total number of samples in the dataset.
  - __getitem__(self, idx): Return the idx-th sample (e.g., an input sequence and its corresponding target sequence, as tensors).
- DataLoader: Wraps a Dataset and provides an iterator to loop over the data in batches. It handles:
  - Batching.
  - Shuffling (optional, but good for training).
  - Parallel data loading using multiple worker processes (optional, for speed).

3.8 Training Data and Loss Computation (for Language Modeling)

What it’s ultimately trying to achieve: To clarify exactly how input and target sequences are structured for training an autoregressive language model, and how the loss is computed across all positions.
The “Shifted” Target: For an input sequence like [T1, T2, T3, T4], the target sequence is [T2, T3, T4, T5].
- When the model sees T1, it tries to predict T2.
- When it sees T1, T2, it tries to predict T3.
- And so on. The hidden state h_t carries context from T1...T_t to help predict T_{t+1}.
Loss Calculation: The cross-entropy loss is calculated at each position where a prediction is made. The total loss for a sequence is typically the average of these per-position losses. When batching, it’s the average loss over all predictable tokens in the batch.