\

Chapter 4: Transformer

9 min read

What this chapter is ultimately trying to achieve

To introduce and explain the core mechanisms of the Transformer architecture, specifically focusing on the decoder-only variant, which is prevalent in autoregressive language models like GPT. The key is to understand how it processes input sequences, attends to relevant information, and incorporates positional context, all without the sequential recurrence of RNNs.

Let’s break down the key concepts:

Core Idea: Parallel Processing with Attention

Unlike RNNs that process tokens one by one, Transformers can process all tokens in an input sequence simultaneously (or at least in parallel up to the context window length). The “magic” that allows them to understand relationships between tokens in this parallel setup is the self-attention mechanism.

4.1 Decoder Block (The Building Block)

  • What it’s ultimately trying to achieve: A Transformer model is typically a stack of identical “decoder blocks” (or encoder blocks, or both, depending on the variant). Each block takes a sequence of token representations as input and outputs a refined sequence of representations of the same length.
  • Decoder-Only Focus: For autoregressive language modeling (predicting the next token), we typically use a stack of decoder blocks. The input is the sequence of tokens seen so far, and the output of the final block is used to predict the next token.
  • Structure of a Decoder Block: Each block has two main sub-layers:
    1. Masked Multi-Head Self-Attention: Allows each token to “look at” other tokens in the sequence (including itself) to gather contextual information. The “masked” part is crucial for autoregressive models to prevent a token from seeing future tokens during training.
    2. Position-Wise Feedforward Network (MLP): A standard fully connected feedforward network applied independently to each token’s representation after the attention step.
  • Additional Components within a block (crucial for making it work):
    • Residual Connections (Skip Connections): The input to each sub-layer is added to its output. This helps with gradient flow and enables training much deeper networks.
    • Layer Normalization (RMSNorm in the book): Applied before each sub-layer to stabilize the activations and improve training.

4.2 Self-Attention (The Heart of the Transformer)

  • What it’s ultimately trying to achieve: For each token in the input sequence, self-attention calculates a new representation by taking a weighted sum of the representations of all tokens in the sequence (respecting the causal mask for decoders). The weights determine how much “attention” each token should pay to every other token (including itself) when computing its updated representation.

  • The QKV (Query, Key, Value) Analogy: Imagine you’re looking up information in a library (this is a common analogy):

    • Query (Q): For the current token you’re trying to update, you formulate a “query” representing what kind of information it’s looking for.
    • Key (K): Every token in the sequence (including the current one) has a “key” that acts like an index tag, describing what kind of information it holds.
    • Value (V): Every token also has a “value,” which is the actual content or representation of that token. The process:
    1. Project Inputs: The input embedding for each token x_t is projected into three different vectors: q_t (query), k_t (key), and v_t (value) using learnable weight matrices (WQ, WK, WV).
    2. Calculate Attention Scores: For a given query q_t, its “compatibility” or “similarity” with every key k_j in the sequence is calculated, usually via a dot product: score(q_t, k_j) = q_t ⋅ k_j.
    3. Scale Scores: The scores are scaled down (typically by the square root of the key vector’s dimension, sqrt(d_k)) to prevent very large values, which can lead to vanishing gradients in the softmax.
    4. Apply Causal Mask (for Decoders): To ensure autoregression (a token can’t see future tokens), scores corresponding to attention to future positions are set to negative infinity before the softmax. This makes their softmax probability zero.
    5. Convert Scores to Weights (Softmax): The masked, scaled scores are passed through a softmax function. This converts the scores into positive attention weights that sum to 1 across all tokens in the sequence. These weights indicate how much q_t should attend to each v_j.
    6. Compute Output: The output representation for token t (g_t in the book) is a weighted sum of all value vectors v_j, where the weights are the attention weights computed in the previous step: g_t = sum_j (attention_weight_tj * v_j).
  • Multi-Head Attention:

    • Intuition: Instead of having one set of Q, K, V matrices, self-attention is often performed multiple times in parallel, each with different learned Q, K, V projection matrices. These are called “heads.”
    • Benefit: Each head can learn to focus on different types of relationships or different aspects of the input sequence (e.g., one head might focus on syntactic relationships, another on semantic ones over a longer distance).
    • Process: The input is split (or projected) into multiple smaller-dimensional Q, K, V sets. Attention is computed independently for each head. The outputs of all heads are then concatenated and linearly projected back to the original embedding dimension.

4.3 Position-Wise Multilayer Perceptron (MLP)

  • What it’s ultimately trying to achieve: After the self-attention mechanism has aggregated contextual information into each token’s representation, the MLP provides further non-linear processing.
  • How it works: It’s a standard two-layer feedforward network (e.g., Linear -> ReLU -> Linear). Crucially, the same MLP (with the same weights) is applied independently to the representation of each token in the sequence.
    • The book describes it as z_t = W2(ReLU(W1 * g_t + b1)) + b2.
    • Often, the intermediate layer in the MLP is larger than the input/output dimension (e.g., 4x the embedding dimension), allowing the model to learn more complex transformations before projecting back down.

4.4 Rotary Position Embedding (RoPE)

  • What it’s ultimately trying to achieve: To inject information about the absolute and relative positions of tokens into the model, since the self-attention mechanism itself is permutation-invariant (shuffling input tokens doesn’t change the raw attention scores if there’s no positional info). RNNs get position info for free due to sequential processing. Transformers need an explicit way.

  • The Core Idea: Relative Position from Absolute Rotation RoPE is an elegant solution that encodes the absolute position of a token in a way that allows the attention mechanism to easily deduce the relative positions between tokens. Imagine you have a spinning compass needle for each word’s query (Q) and key (K) vectors.

    1. Absolute Position as Spin: For a word at position m, its Q and K compass needles are spun by an amount proportional to m. A word at position 1 spins a little, a word at position 2 spins twice as much, and so on. The final orientation of a needle tells you the word’s absolute position.

    2. Attention as Needle Alignment: To calculate attention between two words, the model compares the orientation of the first word’s Q-needle to the second word’s K-needle. The “comparison” measures how well the needles are aligned.

    3. The Magic - Relative Distance Emerges: Due to the mathematical properties of rotation (using sine and cosine), the alignment between two needles—one spun by m units and the other by n units—depends only on the difference in their spin, m - n. The absolute positions don’t matter for the comparison, only the relative distance.

    This is powerful because the model learns relational patterns, like “the word 2 positions after a verb is often an object,” regardless of where that verb appears in the sentence.

  • Key Advantages of RoPE:

    1. Relative from Absolute: It encodes absolute positions, but the self-attention score naturally becomes sensitive to relative distance, which is more intuitive for language.
    2. Multi-Scale Information: It uses different rotation speeds for different parts of the vector, allowing it to capture both local (fast-rotating) and global (slow-rotating) positional context.
    3. Excellent Extrapolation: It generalizes well to sequence lengths longer than those seen during training—a significant advantage for long documents.
    4. No Extra Parameters: RoPE modifies existing Q and K vectors with fixed rotations, avoiding extra learnable parameters for positional information.

4.5 Residual Connections & Layer Normalization (Revisited)

  • Residual Connections: As in Chapter 1’s general NN discussion and as seen in the block diagram, the input to a sub-layer (e.g., self-attention or MLP) is added to its output: x_output = SubLayer(x_input) + x_input. This is critical for training deep stacks of decoder blocks by preventing vanishing gradients.
  • Layer Normalization (RMSNorm): Applied before each sub-layer (self-attention and MLP) and after the residual addition.
    • RMSNorm(x) = (x / sqrt(mean(x^2) + epsilon)) * gamma (where gamma is a learnable scaling parameter and epsilon is for numerical stability).
    • It normalizes the features for each token independently across its embedding dimension. This helps stabilize training, making it less sensitive to the scale of parameters and activations, and allows for faster convergence.

4.6 Key-Value Caching (For Inference)

  • What it’s ultimately trying to achieve: To speed up text generation (inference) which is autoregressive (one token at a time).
  • The Problem: During training, we can compute attention over the whole sequence in parallel. But during inference, when generating token t+1, we’ve already computed the Key (K) and Value (V) matrices for tokens 1...t. Without caching, we’d recompute these K and V vectors for all previous tokens every time we generate a new token.
  • The Solution: Cache (store) the K and V vectors for all previously generated tokens in each layer. When generating the next token:
    1. Only compute the Q, K, V vectors for the newly generated token.
    2. Append the new K and V vectors to their respective cached K and V matrices.
    3. The new Q vector attends to all the K vectors in the updated cache. This significantly reduces computation because the K and V projections for past tokens don’t change.

4.9 Transformer in Python (Implementation)

  • Putting it all together in PyTorch:
    • AttentionHead class: Implements a single attention head, including QKV projections, RoPE, scaled dot-product attention, and masking.
    • MultiHeadAttention class: Contains multiple AttentionHead instances, concatenates their outputs, and applies a final linear projection.
    • MLP class: The position-wise feedforward network.
    • RMSNorm class.
    • DecoderBlock class: Combines MultiHeadAttention, MLP, RMSNorm, and residual connections.
    • DecoderLanguageModel class: Stacks multiple DecoderBlocks, includes an embedding layer at the input, and a final linear layer to project outputs to vocabulary logits. The forward method also creates the causal mask.

The training loop for this Transformer decoder is conceptually very similar to the RNN LM training loop: prepare input and (shifted) target sequences, pass through the model, compute cross-entropy loss, backpropagate, and update.