Chapter 1: Machine Learning Basics

6 min read

Chapter 1: Machine Learning Basics

Even for seasoned practitioners, a quick refresher on the fundamentals, especially as they pertain to language models, is key.

AI and Machine Learning: We start with a brief history – from the early days of AI with concepts like the Perceptron and ELIZA, through AI winters, to the rise of modern ML with deep learning.
Model: At its heart, a model y = f(x). We explore the simple linear model wx + b, understand parameters (weights and bias), and the crucial concept of a loss function (like Mean Squared Error for regression) to quantify error.
Four-Step ML Process: This is the core loop:
1. Collect a dataset.
2. Define the model’s structure.
3. Define the loss function.
4. Minimize the loss (often using derivatives).
Vector and Matrix: We then move to representing data and parameters using vectors (feature vectors, dot products, norms, cosine similarity) and matrices (matrix multiplication, transpose). This is vital for understanding how neural networks process data efficiently.
Neural Network: We introduce non-linearity with activation functions (ReLU, sigmoid, tanh). We look at feedforward neural networks (FNNs), multilayer perceptrons (MLPs), and how layers combine hierarchically.
Gradient Descent: Since analytical solutions for minimizing loss in complex NNs are often infeasible, we rely on gradient descent. I walk through an example of binary classification using logistic regression, introducing binary cross-entropy loss.
Automatic Differentiation (Autograd): Manually deriving gradients is impractical. Modern frameworks like PyTorch automate this with autograd. We see a practical PyTorch example, understanding the forward and backward passes, and the role of tensors.

Chapter 2: Language Modeling Basics

This is where we start tailoring our ML knowledge to text.

Bag of Words (BoW): One of the simplest ways to convert text to numbers for tasks like classification. We discuss corpus, vocabulary, tokenization (words and subwords), and vectorization (document-term matrix). For multi-class classification, we introduce the softmax activation and cross-entropy loss. A PyTorch example shows how to build a simple text classifier.
Word Embeddings: BoW has limitations (sparsity, no semantic understanding). Word embeddings (like those from word2vec’s skip-gram model) represent words as dense vectors where similar words have similar vectors. This allows for capturing semantic similarity and dimensionality reduction.
Byte-Pair Encoding (BPE): A common subword tokenization algorithm. It helps manage vocabulary size and handle out-of-vocabulary words by breaking words into smaller, frequently occurring units.
Language Model Definition: Formally, a language model predicts the next token in a sequence given previous tokens, P(next_token | context). We focus on autoregressive (or causal) language models.
Count-Based Language Models: Before neural nets, n-gram models were standard. We look at how they estimate probabilities (e.g., trigram probability P(w3 | w1, w2) based on counts). We discuss challenges like zero probabilities for unseen n-grams and solutions like backoff and Laplace (add-one) smoothing.
Evaluating Language Models:
- Perplexity: A core metric. Lower is better, indicating the model is less “surprised” by the test data. It’s derived from the average negative log-likelihood.
- ROUGE: For evaluating models on tasks like summarization, comparing model output to reference (ground truth) texts. We touch on ROUGE-N and ROUGE-L.
- Human Evaluation: Essential for qualities automated metrics miss. We discuss Likert scales and pairwise comparisons with Elo ratings.

Chapter 3: Recurrent Neural Network (RNN)

RNNs were a breakthrough for sequential data like text.

Elman RNN: The basic RNN structure, where the output at a given step depends on the current input and the hidden state from the previous step. This “memory” allows RNNs to capture sequential dependencies.
Mini-Batch Gradient Descent: A practical necessity for training large models, processing data in small batches.
Programming an RNN: We’d build an Elman RNN from scratch in PyTorch, understanding how hidden states are passed and updated.
RNN as a Language Model: How to use the RNN architecture to predict the next token.
Embedding Layer: PyTorch’s nn.Embedding layer, which is essentially a lookup table for token embeddings, often trained as part of the model.
Dataset and DataLoader: PyTorch utilities for efficiently loading and batching data for training.

Chapter 4: Transformer

This is the architecture that powers modern LLMs.

Decoder-Only Architecture: We focus on this variant, common for autoregressive LMs.
Key Innovations:
- Self-Attention: Allows the model to weigh the importance of different tokens in the input sequence when processing a particular token. We’d cover query, key, and value matrices.
- Positional Encoding: Since transformers process tokens in parallel (unlike RNNs), they need a way to incorporate word order. Rotary Position Embedding (RoPE) is a key technique here.
Decoder Block Components:
- Masked Multi-Head Self-Attention (with RoPE)
- Position-Wise Multilayer Perceptron (MLP)
- Residual Connections (Skip Connections): Crucial for training deep networks by mitigating the vanishing gradient problem.
- Layer Normalization (specifically, RMSNorm): Stabilizes training.
Key-Value Caching: An optimization for faster inference by caching past key and value states.
Python Implementation: We build a decoder-only Transformer in PyTorch, piece by piece.

Chapter 5: Large Language Model (LLM)

Here, we scale up and look at practical applications.

Why Larger is Better: The impact of scale – more parameters, larger context windows, vast training data, and massive compute – leads to emergent capabilities.
Supervised Finetuning (SFT): Pretrained LLMs predict the next token. SFT trains them to follow instructions, answer questions, or engage in dialogue using high-quality instruction-response pairs. We’d compare a base pretrained model with its instruction-tuned version.
Finetuning a Pretrained Model: Practical steps using a model like GPT-2. We’d cover formatting data for tasks like emotion generation (text-to-label) or instruction following (using formats like ChatML).
Sampling Strategies: Beyond greedy decoding.
- Temperature: Controls randomness.
- Top-k sampling: Limits selection to k most probable tokens.
- Top-p (Nucleus) sampling: Selects from the smallest set of tokens whose cumulative probability exceeds p.
- Penalties: Frequency and presence penalties to discourage repetition.
Low-Rank Adaptation (LoRA): A parameter-efficient finetuning (PEFT) technique. Instead of finetuning all model weights, LoRA adds and trains small “adapter” matrices, significantly reducing computational cost.
LLM as a Classifier: Attaching a classification head to an LLM.
Prompt Engineering: Crafting effective prompts (situation, role, task, format, constraints, examples, call to action). Discussing few-shot prompting and followup strategies.
Hallucinations: Why they happen (models optimize for coherence, not truth) and how to mitigate them (e.g., Retrieval-Augmented Generation - RAG, domain-specific pretraining).
Copyright and Ethics: Critical considerations around training data and generated content.

Chapter 6: Further Reading

Pointers to more advanced topics for continued learning.

Mixture of Experts (MoE): Increases model capacity efficiently by routing tokens to specialized sub-networks.
Model Merging: Combining multiple pretrained models.
Model Compression: Techniques like quantization (e.g., QLoRA) and pruning.
Preference-Based Alignment: Methods like RLHF (Reinforcement Learning from Human Feedback) and Constitutional AI to align LLMs with user values.
Advanced Reasoning: Chain-of-Thought (CoT), Tree of Thought (ToT), ReAct.
Security: Jailbreak attacks, prompt injection.
Vision Language Models (VLMs): Models that understand both text and images.
Preventing Overfitting: Regularization, dropout, early stopping, validation sets.