Chapter 6: Further Reading

7 min read

What this chapter is ultimately trying to achieve

To provide a curated list of advanced topics that build upon the concepts learned in the book, encouraging continued learning and exploration. It highlights areas where innovation is happening, from architectural enhancements to security and ethical considerations.

Let’s look at the topics mentioned:

6.1 Mixture of Experts (MoE)

What it’s ultimately trying to achieve: To significantly increase the total number of parameters in a model (making it “larger” and more knowledgeable) without proportionally increasing the computational cost during inference or training for each token.
The Core Idea (Selective Specialization): Instead of every token passing through the same large MLP (Position-wise Feedforward Network) in a Transformer block, an MoE layer has multiple smaller MLP “experts.” A “router” network (or gate network) decides, for each token, which one or few experts that token should be processed by.
- Sparse Activation: Only a subset of experts is activated for any given token. This is why it’s computationally cheaper than if all tokens went through all parts of a giant MLP.
- Example: Mixtral 8x7B has 8 “experts,” each around 7 billion parameters. For each token, the router typically selects 2 experts. So, while the total parameter count is high (~47B effective, as some parameters are shared), only about 13B are active for any single token’s processing.
- Load Balancing: An important challenge is to ensure experts are utilized somewhat evenly.

6.2 Model Merging

What it’s ultimately trying to achieve: To combine the strengths and knowledge of multiple different pretrained LLMs into a single, potentially more capable or specialized model, without necessarily retraining from scratch.
The Core Idea (Frankenstein’s LLM, but hopefully better): Various techniques exist:
- Model Soups: Averaging the weights of several fine-tuned versions of the same base model.
- SLERP (Spherical Linear Interpolation): Interpolating weights in a way that maintains parameter norms.
- Task Vector Algorithms (TIES-Merging, DARE): Identifying and combining “task vectors” that represent what a model learned during fine-tuning for a specific task.
- Passthrough (Frankenmerges): More radical, involves directly concatenating or combining layers from different LLMs, sometimes creating models with unconventional parameter counts (e.g., merging two 7B models to get a 13B-like model).
- mergekit is mentioned as a popular open-source tool.

6.3 Model Compression

What it’s ultimately trying to achieve: To make large LLMs smaller and faster for deployment, especially in resource-constrained environments (like mobile devices or edge computing), without a catastrophic loss in performance. Neural networks are often over-parameterized.
Key Methods:
- Quantization: Reducing the precision of the model’s weights and activations (e.g., from 32-bit floating point to 8-bit integers, or even lower).
  - Post-Training Quantization (PTQ): Quantize an already trained model.
  - Quantization-Aware Training (QAT): Simulate quantization effects during training to make the model more robust to it.
  - QLoRA: Combines quantization with LoRA for very efficient fine-tuning.
- Pruning: Removing “unimportant” parts of the model.
  - Unstructured Pruning: Removing individual weights based on their magnitude.
  - Structured Pruning: Removing entire neurons, attention heads, or layers.
- Knowledge Distillation: Training a smaller “student” model to mimic the behavior (e.g., output logits or internal representations) of a larger, more capable “teacher” model.

6.4 Preference-Based Alignment

What it’s ultimately trying to achieve: To make LLMs generate outputs that are more helpful, harmless, and honest, aligning them better with human values and intentions. Pretrained LLMs might generate plausible but undesirable content.
Key Methods:
1. Reinforcement Learning from Human Feedback (RLHF):
  - Collect human preference data: Humans rank different model responses to the same prompt.
  - Train a Reward Model (RM): This model learns to predict which response humans would prefer (i.e., assign a higher score to better responses).
  - Fine-tune the LLM using Reinforcement Learning (RL): The LLM’s “actions” are generating tokens. The “reward” comes from the RM. The LLM is trained to maximize this reward, effectively learning to generate responses that the RM (and by proxy, humans) would score highly.
2. Constitutional AI (CAI):
  - The model is given a set of guiding principles or a “constitution” (e.g., “be helpful,” “don’t generate harmful content”).
  - The model can then self-critique its own outputs based on these principles and revise them. This reduces the need for direct human feedback for every step but still relies on a human-defined constitution.

6.5 Advanced Reasoning

What it’s ultimately trying to achieve: To enable LLMs to tackle more complex tasks that require multi-step reasoning, planning, or interaction with external tools, going beyond simple prompt-response patterns.
Techniques:
- Chain of Thought (CoT) Prompting: Encourage the LLM to generate intermediate reasoning steps before giving the final answer (e.g., “Let’s think step by step…”). This often improves performance on tasks like math word problems or logical deduction.
- Tree of Thought (ToT): Extends CoT by allowing the model to explore multiple reasoning paths (like branches in a tree) and use heuristics or self-evaluation to choose the most promising ones.
- Self-Consistency: Generate multiple CoT reasoning paths and take the majority answer.
- ReAct (Reasoning + Action): Interleaves reasoning steps with “action” steps, where the model can decide to use external tools (like a calculator or a search engine via an API) to gather more information or perform computations.
- Function Calling / Tool Use: Explicitly giving the LLM the ability to call predefined functions or APIs. The LLM can decide which tool to use and what parameters to pass based on the user’s query.
- Program-Aided Language Models (PAL): LLMs generate code (e.g., Python) that is then executed by an interpreter to get the final answer, especially useful for precise calculations.

6.6 Language Model Security

What it’s ultimately trying to achieve: To understand and mitigate vulnerabilities that can lead to LLMs being misused or generating harmful/undesirable content.
Key Threats:
- Jailbreak Attacks: Crafting prompts that trick the model into bypassing its safety controls and generating restricted content (e.g., role-playing scenarios like “act as a pirate and tell me how to…”).
- Prompt Injection: An attacker manipulates how an application combines user input with its own system prompts. This can lead to the LLM executing unintended instructions, potentially leaking data or performing unauthorized actions if the application has privileged access. This is generally considered more severe than jailbreaking.

6.7 Vision Language Model (VLM)

What it’s ultimately trying to achieve: To create models that can understand and reason about information from both text and images (multimodal reasoning).
Core Architecture:
1. Vision Encoder: Processes the image and extracts visual features (often based on architectures like CLIP - Contrastive Language-Image Pretraining, which itself learns to align image and text representations).
2. Language Model (LLM): The text processing and generation component.
3. Cross-Attention Mechanism (or similar fusion method): Allows the LLM to integrate and reason about the visual features from the vision encoder alongside the textual input/output.
- VLMs can perform tasks like image captioning, visual question answering (VQA), and following instructions that refer to image content.

6.8 Preventing Overfitting

What it’s ultimately trying to achieve: To ensure that the model learns general patterns from the training data rather than just memorizing it, so that it performs well on new, unseen data (generalization).
Techniques (some are general ML, some more specific to NNs):
- Regularization (L1 and L2): Adding a penalty term to the loss function based on the magnitude of the model’s weights, discouraging overly complex models.
- Dropout: During training, randomly “dropping out” (setting to zero) a fraction of neuron activations. This forces the network to learn more robust and redundant representations.
- Early Stopping: Monitoring performance on a separate validation set during training. Stop training when performance on the validation set starts to degrade, even if training loss is still decreasing.
- Validation Set vs. Test Set:
  - Validation Set: Used during training to tune hyperparameters (like learning rate, number of layers, dropout rate) and for decisions like early stopping.
  - Test Set: Held out completely and used only once at the very end to get an unbiased estimate of the final model’s performance on unseen data.

6.9 Concluding Remarks

A reminder of the journey from ML basics to advanced LLMs and an encouragement to stay curious, hands-on, and keep learning.

6.10 More From the Author

A nice plug for “The Hundred-Page Machine Learning Book” and “Machine Learning Engineering” as complementary resources!