\

Chapter 5: Large Language Model

8 min read

What this chapter is ultimately trying to achieve

To explain why “large” matters in language models, what “large” actually entails, and how these scaled-up pretrained models are then adapted (finetuned) to become useful for a wide range of tasks beyond just predicting the next token. We also delve into practical aspects like interacting with them (sampling, prompt engineering) and addressing their inherent limitations (hallucinations, ethics).

Let’s break down the key sections:

5.1 Why Larger Is Better

  • What it’s ultimately trying to achieve: To establish that the remarkable abilities of LLMs (like understanding complex instructions, generating coherent long-form text, some forms of reasoning, few-shot learning) are not just incremental improvements but often emergent properties that arise when model size, data size, and compute cross certain thresholds.

  • The Core Idea (Scale is Key): Pretraining a Transformer (like the decoder we built in Chapter 4) on a massive dataset of text (trillions of tokens) with a huge number of parameters (billions to trillions) and a large context window (thousands to hundreds of thousands of tokens) allows it to learn intricate patterns, world knowledge, and even some rudimentary reasoning skills simply from the task of predicting the next token.

    • The example of CRISPR-Cas9 in the book illustrates this: to predict the next token accurately in a scientific text, the model must implicitly learn a lot about the underlying concepts.
  • The “Large” Factors:

    1. Large Parameter Count:
      • Our decoder model had ~8 million parameters. Modern LLMs (Llama 3.1 70B, Gemma 2 27B) have billions.
      • More parameters mean more capacity to store information, learn complex patterns, and represent nuances of language and knowledge.
    2. Large Context Size:
      • Our decoder used 30 tokens. LLMs can handle thousands (e.g., GPT-3’s 2K-4K) to over a hundred thousand (e.g., Llama 3.1’s 128K, some models even 1M+).
      • A larger context allows the model to understand and generate text that maintains coherence over much longer spans, remember earlier parts of a conversation or document, and tackle tasks requiring access to more information.
      • Achieving this involves architectural improvements like grouped-query attention and FlashAttention, and specialized training stages like long-context pretraining.
    3. Large Training Dataset:
      • Our RNN was trained on ~25 million tokens. LLMs are trained on trillions of tokens from diverse sources (books, web pages, code, academic papers, social media).
      • This diversity exposes the model to a vast range of language styles, topics, and knowledge.
      • Typically, due to the sheer size, models are trained for a single epoch.
    4. Large Amount of Compute:
      • Training these models requires enormous computational resources (thousands of GPUs running for months, costing millions of dollars).
      • This involves sophisticated parallelization techniques (tensor, pipeline, context, and data parallelism – 4D parallelism).

5.2 Supervised Finetuning (SFT)

  • What it’s ultimately trying to achieve: To transform a base pretrained LLM (which is good at next-token prediction but not necessarily at following instructions) into a helpful and instruction-following assistant or a model specialized for specific tasks.

  • The Core Idea (Teaching to Behave): While pretraining gives the model its raw knowledge and language understanding, SFT teaches it the format of interaction.

    • The model is further trained on a smaller, high-quality dataset of instruction-response pairs (or dialogue turns).
    • Examples:
      • Instruction: “Translate ‘Good night’ into Spanish.” Response: “Buenas noches.”
      • Instruction: “Write a poem about a cat.” Response: “[A poem about a cat]”
    • The model is still trained to predict the next token, but the “context” is now the instruction, and the “target” is the desired response.
    • This process “unlocks” the pretrained knowledge and makes it accessible in a conversational or task-oriented way. The book shows the difference between a base gemma-2-2b (just completes text) and gemma-2-2b-it (instruction-tuned, follows the list continuation prompt).

5.3 Finetuning a Pretrained Model (Practical Example)

  • What it’s ultimately trying to achieve: To walk through the practical steps of finetuning an existing open-weight LLM (like GPT-2 in the book’s example) for a specific task, such as emotion classification.

  • Key Steps & Concepts:

    1. Baseline: Always good to establish a baseline with a simpler model (e.g., logistic regression with BoW for text classification) to gauge if the complex LLM approach is providing significant benefits.
    2. Data Formatting:
      • For emotion generation (LLM outputs the emotion word): Convert examples into a “task description + solution” format. E.g., Input: "Predict emotion: I feel very happy\nEmotion:", Target: "joy [EOS]".
      • The labels tensor for training masks out the input part (e.g., by using -100) so the loss is only computed on the target completion.
      • attention_mask is used to tell the model which tokens are real and which are padding.
    3. Model Loading: Using libraries like Hugging Face Transformers (AutoModelForCausalLM, AutoTokenizer). Setting tokenizer.pad_token = tokenizer.eos_token if the model doesn’t have a dedicated pad token.
    4. Finetuning to Follow Instructions (General Case):
      • Requires a prompting format/style (e.g., Vicuna, Alpaca, ChatML). This defines how instructions and solutions are structured. Consistency with this format is important during inference.
      • The dataset consists of (instruction, solution) pairs.
      • The book mentions generating an emotion (like “joy”) as text output. This is a form of instruction following.

5.4 Sampling From Language Models

  • What it’s ultimately trying to achieve: To control the way tokens are selected from the model’s output probability distribution, balancing creativity and coherence. Greedy decoding (always picking the most probable token) can be repetitive or dull.

  • Techniques:

    1. Basic Sampling with Temperature:
      • Softmax output probabilities are adjusted by a temperature T.
      • T > 1: Flatter distribution, more randomness (creative).
      • T < 1: Sharper distribution, less randomness (focused).
      • T = 1: Standard softmax.
    2. Top-k Sampling: Consider only the k most probable tokens and renormalize their probabilities before sampling.
    3. Top-p (Nucleus) Sampling: Consider the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9) and renormalize before sampling. This is adaptive: if the model is very confident (one token has high probability), p might be met by just a few tokens. If uncertain, many tokens might be included.
    4. Penalties:
      • Frequency Penalty: Reduces the probability of tokens that have already appeared frequently in the generated text.
      • Presence Penalty: Reduces the probability of tokens that have appeared at all, encouraging new topics.

5.5 Low-Rank Adaptation (LoRA)

  • What it’s ultimately trying to achieve: To significantly reduce the computational cost and memory requirements of finetuning large LLMs, making it accessible to users with limited resources. This is a type of Parameter-Efficient Finetuning (PEFT).

  • The Core Idea (Small Changes, Big Impact): Instead of updating all the billions of parameters in an LLM, LoRA freezes the original pretrained weights and introduces a small number of new, trainable “adapter” matrices.

    • For a large weight matrix W_0 (e.g., in an attention or MLP layer), LoRA adds two much smaller matrices, A (shape d x r) and B (shape r x k), where r (the rank) is small (e.g., 8, 16).
    • During finetuning, only A and B are trained.
    • The effective weight matrix becomes W = W_0 + (alpha/r) * B @ A. (The book has AB which might be a slight difference in convention or a typo in my summary; typically it’s a low-rank update BA or AB depending on how A and B are defined, the key is the product of two smaller matrices).
    • The Hugging Face PEFT library (LoraConfig, get_peft_model) simplifies applying LoRA. You specify which layers/modules to adapt (e.g., query, key, value projections in attention).

5.6 LLM as a Classifier (Alternative to Generation)

  • What it’s ultimately trying to achieve: To use an LLM for traditional classification tasks by having it output logits for predefined classes, rather than generating class names as text.

  • How it works:

    • Instead of AutoModelForCausalLM, use AutoModelForSequenceClassification.
    • This class typically adds a classification head (a linear layer + softmax) on top of the final hidden state of the LLM (often the embedding of the last token or a special [CLS] token).
    • This head is trained to map the LLM’s contextual representation to class probabilities.

5.7 Prompt Engineering

  • What it’s ultimately trying to achieve: To guide a finetuned chat LLM to produce desired outputs by carefully crafting the input prompt, without further changing the model’s weights.

  • Features of a Good Prompt:

    1. Situation: Context for the request.
    2. Role: Persona for the LLM to adopt.
    3. Task: Clear, specific instructions.
    4. Output Format: JSON, bullet points, etc.
    5. Constraints: Limitations, preferences.
    6. Quality Criteria: What makes a good response.
    7. Examples (Few-Shot Prompting / In-Context Learning): Provide input-output examples.
    8. Call to Action: Restate the task.
  • Followup Actions: Iterating with the LLM, asking for corrections, using different LLMs for review.

  • Code Generation: Using detailed docstrings and requirements.

  • Documentation Synchronization: Using LLMs to help keep documentation updated with code changes.

5.8 Hallucinations

  • What it’s ultimately trying to achieve: To understand why LLMs sometimes generate plausible-sounding but factually incorrect or nonsensical information, and how to mitigate this.

  • Reasons:

    • Models optimize for next-token prediction (coherence) not factual accuracy.
    • Gaps in training data.
    • Low-quality or biased training data.
    • Error propagation in token-by-token generation.
  • Prevention/Mitigation:

    • Retrieval-Augmented Generation (RAG): Ground responses in externally retrieved, verified information. The LLM uses this retrieved context to formulate its answer.
    • Domain-Specific Pretraining/Finetuning: Further train on reliable, domain-specific data.
    • Multi-step verification workflows, human review.
  • What it’s ultimately trying to achieve: To highlight the complex legal and ethical issues surrounding LLMs.
  • Key Issues:
    1. Training Data: Use of copyrighted material in training datasets (fair use debates).
    2. Generated Content: Copyright status of AI-generated content, potential for reproducing copyrighted material.
    3. Open-Weight Models: Legal implications of sharing weights trained on copyrighted data.
    4. Broader Ethics:
      • Explainability: LLM explanations are post-hoc rationalizations, not true transparency into their decision-making process.
      • Bias: LLMs can absorb and amplify societal biases from training data.